Data sets and session info
All SNP-containing loci/all individuals grouped by estuary.
## /// GENIND OBJECT /////////
##
## // 316 individuals; 4,122 loci; 44,170 alleles; size: 61.4 Mb
##
## // Basic content
## @tab: 316 x 44170 matrix of allele counts
## @loc.n.all: number of alleles per locus (range: 2-80)
## @loc.fac: locus factor for the 44170 columns of @tab
## @all.names: list of allele names for each locus
## @ploidy: ploidy of each individual (range: 2-2)
## @type: codom
## @call: write_genind(data = tidy)
##
## // Optional content
## @pop: population of each individual (group size range: 2-62)
## @strata: a data frame with 32 columns ( LIB_ID, WELL, SAMPLE_ID, OCEAN, REGION1, REGION, ... )
All SNP-containing loci/individuals from northern Gulf of Mexico grouped by sampled estuary.
## /// GENIND OBJECT /////////
##
## // 248 individuals; 4,122 loci; 44,170 alleles; size: 49.9 Mb
##
## // Basic content
## @tab: 248 x 44170 matrix of allele counts
## @loc.n.all: number of alleles per locus (range: 2-80)
## @loc.fac: locus factor for the 44170 columns of @tab
## @all.names: list of allele names for each locus
## @ploidy: ploidy of each individual (range: 2-2)
## @type: codom
## @call: .local(x = x, i = i, j = j, pop = ..1, treatOther = ..2, quiet = ..3,
## drop = drop)
##
## // Optional content
## @pop: population of each individual (group size range: 248-248)
## @strata: a data frame with 32 columns ( LIB_ID, WELL, SAMPLE_ID, OCEAN, REGION1, REGION, ... )
All SNP-containing loci/individuals from US Atlantic grouped by sampled estuary.
## /// GENIND OBJECT /////////
##
## // 68 individuals; 4,122 loci; 44,170 alleles; size: 19.5 Mb
##
## // Basic content
## @tab: 68 x 44170 matrix of allele counts
## @loc.n.all: number of alleles per locus (range: 2-80)
## @loc.fac: locus factor for the 44170 columns of @tab
## @all.names: list of allele names for each locus
## @ploidy: ploidy of each individual (range: 2-2)
## @type: codom
## @call: .local(x = x, i = i, j = j, pop = ..1, treatOther = ..2, quiet = ..3,
## drop = drop)
##
## // Optional content
## @pop: population of each individual (group size range: 68-68)
## @strata: a data frame with 32 columns ( LIB_ID, WELL, SAMPLE_ID, OCEAN, REGION1, REGION, ... )
Tissues (fin clips) were taken from southern flounder young of the year (YOY), juveniles, and adults in estuaries throughout the Gulf from San Antonio Bay, TX to Apalachicola Bay, FL and in the Atlantic from St. John’s River, FL to Winyah Bay, SC. Additional fin clips from adults were obtained offshore of Texas and Louisiana which were assigned to the nearest estuary where possible. Samples are grouped by estuary within ocean basins (GULF, ATL).
## OGR data source with driver: ESRI Shapefile
## Source: "/home/soleary/FLOUNDER/SFL_PopGen/data/BASEMAPS", layer: "ne_10m_land"
## with 1 features
## It has 2 fields
## OGR data source with driver: ESRI Shapefile
## Source: "/home/soleary/FLOUNDER/SFL_PopGen/data/BASEMAPS", layer: "ne_10m_admin_1_states_provinces_lines"
## with 10118 features
## It has 21 fields
Sample distribution of southern flounder young-of-the-year, juveniles, and adults sampled in estuaries throughout the Gulf of Mexico and Western Atlantic. Abbreviations listed in table below.
Southern flounder are assumed to be approximately 1 year at 25 cm (red dashed line), 2 years at 40 cm (blue line). Males are generally smaller, females are considered to be mature at approx 40 cm.
Distribution of total length [mm] across all sample locations. Dashed lines indicates approximate size at one (red) and two (blue) years old.
Sample size per estuary.
| ESTUARY | Abbreviation | n |
|---|---|---|
| ApalachicolaBay | AP | 44 |
| BaratariaBay | BB | 39 |
| CharlestonHarbor | CH | 18 |
| EastMississippiSound | EMS | 5 |
| GalvestonBay | GB | 2 |
| MatagordaBay | MAT | 3 |
| MobileBay | MB | 62 |
| SabineLake | SL | 24 |
| SanAntonioBay | SA | 23 |
| SanteeRivers | SR | 4 |
| StHelenSound | SHS | 4 |
| StJohnsRiver | SJR | 20 |
| Stono-NorthEdistoRivers | ST | 4 |
| WestMississippiSound | WMS | 34 |
| WinyahBay | WB | 18 |
Sample size per ocean basin.
| OCEAN | n |
|---|---|
| GULF | 248 |
| ATL | 68 |
Sample size per estuary and age class. YOY defined as fished caught at 25cm or less, juveniles 25 - 40cm, and adults as > 40cm.
| ESTUARY | ADULT | JUVENILE | YOY | NA |
|---|---|---|---|---|
| ApalachicolaBay | 1 | 3 | 40 | NA |
| BaratariaBay | 7 | 32 | NA | NA |
| CharlestonHarbor | 7 | 9 | 2 | NA |
| EastMississippiSound | 3 | 2 | NA | NA |
| GalvestonBay | NA | 2 | NA | NA |
| MatagordaBay | NA | NA | 3 | NA |
| MobileBay | 48 | 11 | NA | 3 |
| SabineLake | 2 | 19 | 3 | NA |
| SanAntonioBay | 3 | 10 | 10 | NA |
| SanteeRivers | 2 | 2 | NA | NA |
| StHelenSound | 2 | 2 | NA | NA |
| StJohnsRiver | NA | NA | 20 | NA |
| Stono-NorthEdistoRivers | 1 | NA | 3 | NA |
| WestMississippiSound | 9 | 22 | 2 | 1 |
| WinyahBay | 7 | 8 | 3 | NA |
Year sampled per estuary for YOY (defined as < 25cm).
| ESTUARY | 2013 | 2014 | 2015 | 2016 | NA |
|---|---|---|---|---|---|
| CharlestonHarbor | 1 | NA | 1 | NA | NA |
| WinyahBay | 3 | NA | NA | NA | NA |
| Stono-NorthEdistoRivers | NA | 3 | NA | NA | NA |
| WestMississippiSound | NA | NA | 2 | NA | NA |
| ApalachicolaBay | NA | NA | NA | 40 | NA |
| MatagordaBay | NA | NA | NA | 3 | NA |
| SabineLake | NA | NA | NA | 3 | NA |
| SanAntonioBay | NA | NA | NA | 9 | 1 |
| StJohnsRiver | NA | NA | NA | 20 | NA |
See Genotyping.Rmd for details/scripts for SNP calling, filtering, and haplotyping.
DNA was extracted using Mag-Bind Blood and Tissue DNA kits (Omega Bio-Tek). Double digest restriction-site associated DNA (ddRAD) libraries were constructed using a modified protocol (Portnoy et al. 2015Portnoy, D. S., J. B. Puritz, C. M. Hollenbeck, J. Gelsleichter, D. Chapman, and J. R. Gold. 2015. “Selection and sex-biased dispersal in a coastal shark: The influence of philopatry on adaptive variation.” Molecular Ecology 24 (23): 5877–85. https://doi.org/10.1111/mec.13441.) and sequenced on four separate lanes of an Illumina HiSeq 2500.
Raw sequences were demultiplexed using process_radtags (Catchen et al. 2013Catchen, Julian, Paul A. Hohenlohe, Susan Bassham, Angel Amores, and William A. Cresko. 2013. “Stacks: An analysis tool set for population genomics.” Molecular Ecology 22 (11): 3124–40. https://doi.org/10.1111/mec.12354.). Quality trimming, read mapping, and SNP calling were performed using the dDocent pipeline (Puritz, Hollenbeck, and Gold 2014Puritz, Jonathan B, Christopher M Hollenbeck, and John R Gold. 2014. “dDocent : a RADseq, variant-calling pipeline designed for population genomics of non-model organisms.” PeerJ 2: e431. https://doi.org/10.7717/peerj.431.) and a reduced-representation reference genome previously produced for southern flounder (approximately 2 – 5% of the genome).
Raw SNPs were rigorously filtered using VCFtools (Danecek et al. 2011Danecek, Petr, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert E. Handsaker, et al. 2011. “The variant call format and VCFtools.” Bioinformatics 27 (15): 2156–8. https://doi.org/10.1093/bioinformatics/btr330.) and custom scripts following (???). Final filtering thresholds were a sequence quality of 20, a minimum genotype call rate per locus of 90%, a minor allele count of 3, a minimum genotype depth of 3, and a mean minimum depth of 15. Individuals with > 85% missing data were removed. SNPs were further filtered based on allele balance, quality/depth ratio, mapping quality ratio of reference/alternate alleles, properly paired status, strand representation, and maximum depth. Finally, rad_haplotyper (Willis et al. 2017Willis, Stuart C., Christopher M. Hollenbeck, Jonathan B. Puritz, John R. Gold, and David S. Portnoy. 2017. “Haplotyping RAD loci: an efficient method to filter paralogs and account for physical linkage.” Molecular Ecology Resources, February. https://doi.org/10.1111/1755-0998.12647.) was used to collapse SNPs on the same contig into haplotypes producing a final data set consisting of SNP-containing loci (hereafter loci) for data analysis.
Presence of \(F_{ST}\)-outlier loci was assessed using two methods, the FDIST method (Beaumont and Nichols 1996Beaumont, M. A., and R. A. Nichols. 1996. “Evaluating loci for use in the genetic analysis of population structure.” Proceedings of the Royal Society B: Biological Sciences 263 (1377): 1619–26. https://doi.org/10.1098/rspb.1996.0237.) as implemented in Arlequin (Excoffier and Lischer 2010Excoffier, Laurent, and Heidi E. L. Lischer. 2010. “Arlequin suite ver 3.5: A new series of programs to perform population genetics analyses under Linux and Windows.” Molecular Ecology Resources 10 (3): 564–67. https://doi.org/10.1111/j.1755-0998.2010.02847.x.), and a Bayesian modeling method implemented in BayeScan (Foll and Gaggiotti 2008Foll, Matthieu, and Oscar Gaggiotti. 2008. “A genome-scan method to identify selected loci appropriate for both dominant and codominant markers: A Bayesian perspective.” Genetics 180 (2): 977–93. https://doi.org/10.1534/genetics.108.092221.). For both methods, loci with significantly higher \(F_{ST}\) than expected under a neutral model are considered as loci putatively under directional selection. Given low background \(F_{ST}\) values typical of marine fish data sets (Knutsen et al. 2011Knutsen, H., E. M. Olsen, P. E. Jorde, S. H. Espeland, C. André, and N. C. Stenseth. 2011. “Are low but statistically significant levels of genetic differentiation in marine fishes ’biologically meaningful’? A case study of coastal Atlantic cod.” Molecular Ecology 20 (4): 768–83. https://doi.org/10.1111/j.1365-294X.2010.04979.x.), an assessment for loci putatively under balancing selection (\(F_{ST}\) significantly lower than expected) was not conducted.
Analysis in Arlequin was based on 20,000 coalescent simulations using a strict island model. To account for multiple testing, p-values were corrected assuming a false discovery rate (FDR) < 0.05. BayeScan runs consisted of 25 pilot runs of 5,000 iterations, followed by a total of 550,000 iterations (burn-in of 50,000 iterations, 10,000 samples with thinning interval of 50). A FDR of 0.05 was used as the threshold for outlier detection.
For both methods, \(F_{ST}\)-outlier analysis was run for all individuals grouped by basin, and separately for Gulf and Atlantic individuals grouped by estuary. The distribution of loci flagged during \(F_{ST}\)-outlier analysis loci across linkage groups (chromosomes) was assessed using a previously established linkage map.
Bayesian maximum likelihood (multinomial-Dirichlet model)
BayeScan uses differences in allele frequencies between populations to identify \(F_{ST}\)-outlier loci. The null model (i.e. distribution of \(F_{ST}\) for neutral loci) is generated based on the multinomial-Dirichlet model which assumes allele frequencies of sub-populations are correlated through a common migrant gene pool from which they differ to varying degrees; this difference is measured by a subpopulation-specific \(F_{ST}\) coefficient. Therefore, this model can incorporate ecologically realistic scenarios and is a robust method even when effective sizes and migration rates differ among subpopulations.
Selection is introduced by decomposing population-specific \(F_{ST}\) coefficients into population-specific component (beta), shared by all loci, and locus-specific component (alpha), shared by all poulations using logistic regresstion. If a locus-specific component is necessary to explain observed pattern of diversity (i.e. alpha significantly different from 0), departure from neutrality is inferred.
BayeScan calculates posterior probability for model including selection using bayes factors. Multiple testing is needed to incorporate identifying loci as under selection by chance. The posterior odds are calculated to determine outlier loci. Posterior odds are calculated as the ratio of posterior probabilities indicating how much more likely the model with selection is compared to netural model. Posterior probabilities can be used to directly control FDR (expected proportion of false positives amont outlier loci). The q-value of each locus indicates the minimum FDR at which this locus may become significant, e.g. q-value > 0.05 means that 5% of corresponding outlier markers are expected to be false positives (5% threshold more stringent than corresponding p-value in classic statistics).
Parameters used for BayeScan run
## [1] "There are 4122 loci."
## [2] ""
## [3] "There are 2 populations."
## [4] ""
## [5] "Burn in: 50000"
## [6] "Thining interval: 50"
## [7] "Sample size: 10000"
## [8] "Resulting total number of iterations: 550000"
## [9] "Nb of pilot runs: 25"
## [10] "Length of each pilot run: 5000"
## [11] ""
Use q-value to determine if a locus is a good candidate for locus being under the influence of selection.
Comparison of log10(qvalue) and Fst per locus.
A total of 30 loci have a q-value < 0.05.
FDIST method implemented in Arlequin
Coalescent simulations based on island model are used to generate the null distribution for neutral loci based on a given number of populations.
Fst-heterozygosity distribution for samples grouped by ocean basin. Red open circles indicate loci significantly outside the simulated null distribution.
A total of 329 loci were flagged as outlier using a corrected p-value < 0.05.
Genome-wide distribution of Fst values; loci flagged as outlier by Arlequin are indicated in red.
Fst estimated for 4122 loci, 1387 loci were mapped on the linkage map.
Bayesian maximum likelihood (multinomial-Dirichlet model)
Parameters used for BayeScan run
## [1] "There are 4107 loci."
## [2] ""
## [3] "There are 5 populations."
## [4] ""
## [5] "Burn in: 50000"
## [6] "Thining interval: 50"
## [7] "Sample size: 10000"
## [8] "Resulting total number of iterations: 550000"
## [9] "Nb of pilot runs: 25"
## [10] "Length of each pilot run: 5000"
## [11] ""
Use q-value to determine if a locus is a good candidate for locus being under the influence of selection.
Comparison of log10(qvalue) and Fst per locus.
no significant outlier detected
FDIST method (Arlequin)
Fst-heterozygosity distribution.
no significant outlier detected
Bayesian maximum likelihood (multinomial-Dirichlet model)
no significant outlier detected
FDIST method (arlequin)
no significant outlier detected
Loci were subdivided into two data sets: outlier loci (consensus loci identified by both outlier detection methods as putatively under directional selection) and neutral loci (all remaining loci). Only samples from estuaries with ≥18 individuals were used in the analysis.
Hierarchical, locus-by-locus analysis of molecular variance (AMOVA), as implemented in Arlequin, was used to test homogeneity of each component both between oceans and among estuaries within ocean basins. Divergence in neutral loci within each ocean basin was explored further by single-level AMOVA. For each AMOVA, significance for each component of variation was assessed by permuting individuals between groups 10,000 times.
# AMOVA (estuaries within ocean basins) & pairwise Fst neutral
./scr/arlecore3522_64bit data/POPGEN/SFL.estuaries.genotypes_arlequin.arp scr/Fstatistics_neutral_all.ars
# AMOVA (estuaries within ocean basins) & pairwise Fst outlier
./scr/arlecore3522_64bit data/POPGEN/SFL.estuaries.genotypes_arlequin.arp scr/Fstatistics_outlier_all.ars
# single level AMOVA Gulf (neutral)
./scr/arlecore3522_64bit data/POPGEN/SFL.estuaries.genotypes_arlequin.arp scr/Fstatistics_neutral_Gulf.ars
# single level AMOVA Atlantic (neutral)
./scr/arlecore3522_64bit data/POPGEN/SFL.estuaries.genotypes_arlequin.arp scr/Fstatistics_neutral_Atl.ars
Copy results files (*xml) into /results/ directory and rename; delete results directory. Extract AMOVA and pairwise \(F_{ST}\) results from xml file and format into tab-delimted file.
Locus-by-locus AMOVA using neutral loci indicating variance partitioning and significance of each component.
| Source of variation | Percentage Variation | Average F-statistic over all loci | p-value |
|---|---|---|---|
| among oceans | 4.13479 | 0.04135 | <0.0001 |
| among estuaries w/in oceans | 0.15511 | 0.00162 | 0.00931 |
| among individuals w/in estuaries | 95.71010 | 0.04290 | <0.0001 |
Locus-by-locus AMOVA using outlier loci indicating variance partitioning and significance of each component.
| Source of variation | Percentage Variation | Average F-statistic over all loci | p-value |
|---|---|---|---|
| among oceans | 32.75179 | 0.32752 | <0.0001 |
| among estuaries w/in oceans | 0.13319 | 0.00198 | 0.60198 |
| among individuals w/in estuaries | 67.11502 | 0.32885 | <0.0001 |
Compare global \(F_{ST}\) (single level AMOVA results) for individuals within Gulf and Atlantic basins.
| Source of variation | Percentage Variation | Average F-statistic over all loci | p-value |
|---|---|---|---|
| among estuaries | 0.14665 | 0.00147 | 0.027 |
| within estuaries | 99.85335 | NA | NA |
| Source of variation | Percentage Variation | Average F-statistic over all loci | p-value |
|---|---|---|---|
| among estuaries | 0.26015 | 0.0026 | 0.098 |
| within estuaries | 99.73985 | NA | NA |
Pairwise estimates of \(F_{ST}\) for each component was generated in Arlequin as a post hoc test for homogeneity between estuaries. Significance was determined by permuting individuals between groups 10,000 times. P-values were corrected for multiple comparisons assuming an FDR of 0.05.
Comparison of pairwise Fst (above the diagonal) and level of significance (below the diagonal) between all pairs of estuaries in the Gulf and Atlantic using neutral loci only. * indicates significant Fst-values.
| FST | SA | SL | BB | WMS | MB | AP | SJR | CH | WB |
|---|---|---|---|---|---|---|---|---|---|
| SA | - | 0.002 | 0.00109 | 0.00141 | 0.0011 | 0.00169* | 0.0466* | 0.04237* | 0.04437* |
| SL | 0.03297 | - | 0.00102 | 0.00105 | 0.00042 | 0.00106 | 0.0452* | 0.04114* | 0.04279* |
| BB | 0.21018 | 0.32442 | - | 0.00067 | 0.00076 | 0.00079 | 0.04627* | 0.04213* | 0.04391* |
| WMS | 0.06445 | 0.37571 | 0.38046 | - | 0.00045 | 0.00056 | 0.04629* | 0.04207* | 0.04416* |
| MB | 0.12533 | 0.90229 | 0.08762 | 0.66033 | - | 0.00062 | 0.04509* | 0.04059* | 0.04284* |
| AP | 0.00406 | 0.28472 | 0.14801 | 0.56796 | 0.20097 | - | 0.04573* | 0.0414* | 0.04325* |
| SJR | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | - | 0.00104 | 0.00114 |
| CH | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.44362 | - | 0.00114 |
| WB | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.604305 | 0.14692 | - |
Pairwise Fst, p-value, and adjusted p-value for all pairs of estuaries in the Gulf and Atlantic using neutral loci only.
| pop1 | pop2 | Fst | p_val | p_adj |
|---|---|---|---|---|
| SL | MB | 0.00042 | 0.902290 | 0.9022900 |
| WMS | MB | 0.00045 | 0.660330 | 0.6791966 |
| SJR | WB | 0.00114 | 0.604305 | 0.6398524 |
| WMS | AP | 0.00056 | 0.567960 | 0.6195927 |
| SJR | CH | 0.00104 | 0.443620 | 0.4990725 |
| SL | WMS | 0.00105 | 0.375710 | 0.4418245 |
| BB | WMS | 0.00067 | 0.380460 | 0.4418245 |
| SL | BB | 0.00102 | 0.324420 | 0.4027283 |
| SL | AP | 0.00106 | 0.284720 | 0.3660686 |
| SA | BB | 0.00109 | 0.210180 | 0.2802400 |
| MB | AP | 0.00062 | 0.200970 | 0.2782662 |
| BB | AP | 0.00079 | 0.148010 | 0.2131344 |
| CH | WB | 0.00114 | 0.146920 | 0.2131344 |
| SA | MB | 0.00110 | 0.125330 | 0.1961687 |
| BB | MB | 0.00076 | 0.087620 | 0.1433782 |
| SA | WMS | 0.00141 | 0.064450 | 0.1104857 |
| SA | SL | 0.00200 | 0.032970 | 0.0593460 |
| SA | AP | 0.00169 | 0.004060 | 0.0076926 |
| SA | SJR | 0.04660 | 0.000000 | 0.0000000 |
| SA | CH | 0.04237 | 0.000000 | 0.0000000 |
| SA | WB | 0.04437 | 0.000000 | 0.0000000 |
| SL | SJR | 0.04520 | 0.000000 | 0.0000000 |
| SL | CH | 0.04114 | 0.000000 | 0.0000000 |
| SL | WB | 0.04279 | 0.000000 | 0.0000000 |
| BB | SJR | 0.04627 | 0.000000 | 0.0000000 |
| BB | CH | 0.04213 | 0.000000 | 0.0000000 |
| BB | WB | 0.04391 | 0.000000 | 0.0000000 |
| WMS | SJR | 0.04629 | 0.000000 | 0.0000000 |
| WMS | CH | 0.04207 | 0.000000 | 0.0000000 |
| WMS | WB | 0.04416 | 0.000000 | 0.0000000 |
| MB | SJR | 0.04509 | 0.000000 | 0.0000000 |
| MB | CH | 0.04059 | 0.000000 | 0.0000000 |
| MB | WB | 0.04284 | 0.000000 | 0.0000000 |
| AP | SJR | 0.04573 | 0.000000 | 0.0000000 |
| AP | CH | 0.04140 | 0.000000 | 0.0000000 |
| AP | WB | 0.04325 | 0.000000 | 0.0000000 |
Comparison of pairwise Fst (above the diagonal) and level of significance (below the diagonal) between all pairs of estuaries in the Gulf and Atlantic using outlier loci only. * indicates significant Fst-values.
| FST | SA | SL | BB | WMS | MB | AP | SJR | CH | WB |
|---|---|---|---|---|---|---|---|---|---|
| SA | - | -0.00112 | 0.00192 | 0.00468 | -0.00028 | 0.00215 | 0.33993* | 0.3419* | 0.30559* |
| SL | 0.87476 | - | 0.00062 | 0.00502 | 0.00177 | 0.00145 | 0.34198* | 0.3445* | 0.30632* |
| BB | 0.46273 | 0.68815 | - | 0.00088 | 0.00172 | 0.00268 | 0.34807* | 0.34365* | 0.31019* |
| WMS | 0.15038 | 0.12741 | 0.59954 | - | 0.0019 | 0.00357 | 0.32652* | 0.32446* | 0.29069* |
| MB | 0.77893 | 0.36264 | 0.26572 | 0.26195 | - | 0.00152 | 0.34004* | 0.33477* | 0.30326* |
| AP | 0.39481 | 0.5244 | 0.19206 | 0.11633 | 0.28552 | - | 0.34818* | 0.34395* | 0.31226* |
| SJR | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | - | -0.00775 | 0.00085 |
| CH | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.99287 | - | -0.00037 |
| WB | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.60271 | 0.62142 | - |
Pairwise Fst, p-value, and adjusted p-value for all pairs of estuaries in the Gulf and Atlantic using outlier loci only.
| pop1 | pop2 | Fst | p_val | p_adj |
|---|---|---|---|---|
| SJR | CH | -0.00775 | 0.99287 | 0.9928700 |
| SA | SL | -0.00112 | 0.87476 | 0.8997531 |
| SA | MB | -0.00028 | 0.77893 | 0.8247494 |
| SL | BB | 0.00062 | 0.68815 | 0.7507091 |
| BB | WMS | 0.00088 | 0.59954 | 0.6990975 |
| SJR | WB | 0.00085 | 0.60271 | 0.6990975 |
| CH | WB | -0.00037 | 0.62142 | 0.6990975 |
| SL | AP | 0.00145 | 0.52440 | 0.6509793 |
| SA | BB | 0.00192 | 0.46273 | 0.5949386 |
| SA | AP | 0.00215 | 0.39481 | 0.5264133 |
| SL | MB | 0.00177 | 0.36264 | 0.5021169 |
| MB | AP | 0.00152 | 0.28552 | 0.4111488 |
| BB | MB | 0.00172 | 0.26572 | 0.3985800 |
| WMS | MB | 0.00190 | 0.26195 | 0.3985800 |
| BB | AP | 0.00268 | 0.19206 | 0.3142800 |
| SA | WMS | 0.00468 | 0.15038 | 0.2577943 |
| SL | WMS | 0.00502 | 0.12741 | 0.2293380 |
| WMS | AP | 0.00357 | 0.11633 | 0.2204147 |
| SA | SJR | 0.33993 | 0.00000 | 0.0000000 |
| SA | CH | 0.34190 | 0.00000 | 0.0000000 |
| SA | WB | 0.30559 | 0.00000 | 0.0000000 |
| SL | SJR | 0.34198 | 0.00000 | 0.0000000 |
| SL | CH | 0.34450 | 0.00000 | 0.0000000 |
| SL | WB | 0.30632 | 0.00000 | 0.0000000 |
| BB | SJR | 0.34807 | 0.00000 | 0.0000000 |
| BB | CH | 0.34365 | 0.00000 | 0.0000000 |
| BB | WB | 0.31019 | 0.00000 | 0.0000000 |
| WMS | SJR | 0.32652 | 0.00000 | 0.0000000 |
| WMS | CH | 0.32446 | 0.00000 | 0.0000000 |
| WMS | WB | 0.29069 | 0.00000 | 0.0000000 |
| MB | SJR | 0.34004 | 0.00000 | 0.0000000 |
| MB | CH | 0.33477 | 0.00000 | 0.0000000 |
| MB | WB | 0.30326 | 0.00000 | 0.0000000 |
| AP | SJR | 0.34818 | 0.00000 | 0.0000000 |
| AP | CH | 0.34395 | 0.00000 | 0.0000000 |
| AP | WB | 0.31226 | 0.00000 | 0.0000000 |
Genomic diversity was estimated for each estuary (N > 18), using Nei’s gene diversity (Nei 1973Nei, Masatoshi. 1973. “Analysis of Gene Diversity in Subdivided Populations.” Proceedings of the National Academy of Sciences 70 (12): 3321–3. https://doi.org/10.1073/pnas.70.12.3321.), rarefied allele counts, and evenness. The latter is a measure of the distribution of allele frequencies and was estimated as the ratio of the Stoddart & Taylor index (diversity weighted for more abundant alleles) and the Shannon-Wiener index (diversity weighted for rarer alleles), as implemented in poppr (Kamvar, Tabima, and Grünwald 2014Kamvar, Zhian N., Javier F. Tabima, and Niklaus J. Grünwald. 2014. “Poppr : an R package for genetic analysis of populations with clonal, partially clonal, and/or sexual reproduction.” PeerJ 2 (March): e281. https://doi.org/10.7717/peerj.281.). For each measure of diversity, a Friedman’s rank sum test was used to test homogeneity among estuaries. A Wilcoxon signed-rank test was used post hoc to test for pairwise differences between estuaries; p-values were corrected for multiple comparisons, using a sequential Bonferroni approach (Rice 1989). The number of fixed loci was documented for each region and estuary, using rarefied allele counts.
Compare patterns of gene diversity across loci within estuaries.
Distribution of Nei’s gene diversity (Hs) per locus among individuals grouped by estuary.
Test for significant differences among estuaries using Friedman’s test.
# remove loci with Na values
rm <- read_delim("results/gendiv.locstats", delim = "\t") %>%
select(GRP, LOCUS, Hs) %>%
filter(GRP %in% levels_estuaries3) %>%
mutate(GRP = ordered(GRP, levels = levels_estuaries3)) %>%
filter(is.na(Hs))
temp <- het %>%
filter(!LOCUS %in% rm$LOCUS)
friedman.test(Hs ~ GRP | LOCUS, data = temp)
##
## Friedman rank sum test
##
## data: Hs and GRP and LOCUS
## Friedman chi-squared = 158.21, df = 8, p-value < 0.00000000000000022
Test for significant pairwise differences between estuaries using Wilcoxon signed rank test; tests symmetry of numeric repeated measurements (statistic per locus) in block design.
# remove loci with Na values
rm <- het %>%
filter(is.na(Hs))
het <- het %>%
filter(!LOCUS %in% rm$LOCUS)
# groups to compare
comp <- as.character(unique(het$GRP))
# pairs of comparisons
# pairs <- combn(comp, 2, simplify = FALSE)
pairs <- expand.grid(comp, comp) %>%
filter(!Var1 == Var2) %>%
rownames_to_column("PAIR") %>%
split(.$PAIR) %>%
purrr::map(function(x){
x %>%
select(-PAIR) %>%
gather(key = temp, value = GRP, 1:2) %>%
select(-temp)
})
# empty data frame for results
results <- setNames(data.frame(matrix(ncol = 4, nrow = 0)),
c("pop1", "pop2", "stat", "p.value")) %>%
mutate(pop1 = as.character(pop1),
pop2 = as.character(pop2),
stat = as.numeric(stat),
p.value = as.numeric(p.value))
n <- as.numeric(length(pairs))
# loop over pairs
for(p in 1:length(pairs)){
pair <- pairs[[p]]$GRP
temp <- het %>%
filter(GRP %in% pair) %>%
mutate(GRP = ordered(GRP, levels = pair),
LOCUS = as.factor(LOCUS)) %>%
droplevels()
wilcox <- wilcoxsign_test(Hs ~ GRP | LOCUS,
data = temp,
zero.method = "Pratt")
df <- data.frame("pop1" = pair[1],
"pop2" = pair[2],
"stat" = as.numeric(wilcox@statistic@teststatistic),
"p-value" = as.numeric(pvalue(wilcox)))
results <- bind_rows(results, df)
}
results <- results %>%
mutate(p_adj = p.adjust(p.value, "BH")) %>%
mutate(pop1 = ordered(pop1, levels = levels_estuaries3),
pop2 = ordered(pop2, levels = levels_estuaries3)) %>%
mutate(OCEAN1 = ifelse(pop1 %in% c("WinyahBay", "CharlestonHarbor", "StJohnsRiver"), "ATL", "GULF"),
OCEAN2 = ifelse(pop2 %in% c("WinyahBay", "CharlestonHarbor", "StJohnsRiver"), "ATL", "GULF"),
p.value = round(p.value, digits = 5),
p_adj = round(p_adj, digits = 5)) %>%
unite(COMP, OCEAN1, OCEAN2, sep = "-")
Compare significance of pairwise differences.
Level of significance of pairwise comparisons of distribution of Nei’s gene diversity across loci for individuals samples in a given estuary.
Test statistic, p-value and adjusted p-value of pairwise comparisons of Nei’s gene diversity.
| pop1 | pop2 | stat | p.value | p_adj | COMP |
|---|---|---|---|---|---|
| WinyahBay | StJohnsRiver | 0.00 | 0.99938 | 0.99938 | ATL-ATL |
| ApalachicolaBay | MobileBay | 0.07 | 0.94052 | 0.96739 | GULF-GULF |
| SanAntonioBay | ApalachicolaBay | 0.26 | 0.79390 | 0.84060 | GULF-GULF |
| ApalachicolaBay | BaratariaBay | 0.57 | 0.56670 | 0.61822 | GULF-GULF |
| MobileBay | BaratariaBay | 0.66 | 0.50644 | 0.56974 | GULF-GULF |
| BaratariaBay | WestMississippiSound | 0.88 | 0.37688 | 0.43767 | GULF-GULF |
| SanAntonioBay | BaratariaBay | 0.99 | 0.31997 | 0.38941 | GULF-GULF |
| SanAntonioBay | MobileBay | 0.99 | 0.32450 | 0.38941 | GULF-GULF |
| CharlestonHarbor | WinyahBay | 1.11 | 0.26723 | 0.34358 | ATL-ATL |
| ApalachicolaBay | WestMississippiSound | 1.46 | 0.14337 | 0.19116 | GULF-GULF |
| MobileBay | WestMississippiSound | 1.55 | 0.12196 | 0.16886 | GULF-GULF |
| CharlestonHarbor | StJohnsRiver | 1.73 | 0.08397 | 0.12092 | ATL-ATL |
| SanAntonioBay | WestMississippiSound | 2.02 | 0.04308 | 0.06462 | GULF-GULF |
| WestMississippiSound | CharlestonHarbor | 2.73 | 0.00640 | 0.01002 | GULF-ATL |
| WestMississippiSound | StJohnsRiver | 2.84 | 0.00455 | 0.00745 | GULF-ATL |
| SabineLake | SanAntonioBay | 2.92 | 0.00352 | 0.00603 | GULF-GULF |
| BaratariaBay | CharlestonHarbor | 2.98 | 0.00291 | 0.00524 | GULF-ATL |
| MobileBay | CharlestonHarbor | 3.48 | 0.00050 | 0.00095 | GULF-ATL |
| ApalachicolaBay | CharlestonHarbor | 3.48 | 0.00050 | 0.00095 | GULF-ATL |
| SanAntonioBay | CharlestonHarbor | 3.61 | 0.00030 | 0.00064 | GULF-ATL |
| BaratariaBay | StJohnsRiver | 3.80 | 0.00015 | 0.00033 | GULF-ATL |
| SabineLake | ApalachicolaBay | 3.81 | 0.00014 | 0.00033 | GULF-GULF |
| WestMississippiSound | WinyahBay | 3.95 | 0.00008 | 0.00020 | GULF-ATL |
| SabineLake | MobileBay | 3.98 | 0.00007 | 0.00019 | GULF-GULF |
| ApalachicolaBay | StJohnsRiver | 4.12 | 0.00004 | 0.00012 | GULF-ATL |
| MobileBay | StJohnsRiver | 4.17 | 0.00003 | 0.00010 | GULF-ATL |
| BaratariaBay | WinyahBay | 4.28 | 0.00002 | 0.00007 | GULF-ATL |
| SabineLake | BaratariaBay | 4.34 | 0.00001 | 0.00006 | GULF-GULF |
| SanAntonioBay | StJohnsRiver | 4.35 | 0.00001 | 0.00006 | GULF-ATL |
| SanAntonioBay | WinyahBay | 4.94 | 0.00000 | 0.00000 | GULF-ATL |
| ApalachicolaBay | WinyahBay | 5.11 | 0.00000 | 0.00000 | GULF-ATL |
| MobileBay | WinyahBay | 5.19 | 0.00000 | 0.00000 | GULF-ATL |
| SabineLake | WestMississippiSound | 5.24 | 0.00000 | 0.00000 | GULF-GULF |
| SabineLake | CharlestonHarbor | 5.68 | 0.00000 | 0.00000 | GULF-ATL |
| SabineLake | StJohnsRiver | 5.93 | 0.00000 | 0.00000 | GULF-ATL |
| SabineLake | WinyahBay | 6.35 | 0.00000 | 0.00000 | GULF-ATL |
# overall
setPop(gen) <- ~OVERALL
dat <- hierfstat:::.genind2hierfstat(gen)
ar <- allelic.richness(dat,
diploid = TRUE)
ar <- as.data.frame(ar$Ar) %>%
rownames_to_column("LOCUS") %>%
rename(ALL = `1`)
# by ocean
setPop(gen) <- ~OCEAN
dat <- hierfstat:::.genind2hierfstat(gen)
df <- allelic.richness(dat,
diploid = TRUE)
df <- as.data.frame(df$Ar) %>%
rownames_to_column("LOCUS") %>%
rename(GULF = `1`,
ATL = `2`)
ar <- left_join(ar, df)
# by region
setPop(gen) <- ~REGION
dat <- hierfstat:::.genind2hierfstat(gen)
df <- allelic.richness(dat,
diploid = TRUE)
df <- as.data.frame(df$Ar) %>%
rownames_to_column("LOCUS") %>%
rename(CGULF = `1`,
EGULF = `2`,
SWATL = `3`,
MATL = `4`,
WGULF = `5`)
ar <- left_join(ar, df)
# by estuary
setPop(gen) <- ~ESTUARY
gen_est <- seppop(gen)
temp <- gen_est[c("SanAntonioBay", "SabineLake",
"BaratariaBay", "WestMississippiSound", "MobileBay",
"ApalachicolaBay",
"StJohnsRiver", "CharlestonHarbor", "WinyahBay")]
temp <- repool(temp)
setPop(temp) <- ~ESTUARY
dat <- hierfstat:::.genind2hierfstat(temp)
df <- allelic.richness(dat,
diploid = TRUE)
df <- as.data.frame(df$Ar) %>%
rownames_to_column("LOCUS") %>%
rename(SanAntonioBay = `1`,
SabineLake = `2`,
BaratariaBay = `3`,
WestMississippiSound = `4`,
MobileBay = `5`,
ApalachicolaBay = `6`,
StJohnsRiver = `7`,
CharlestonHarbor = `8`,
WinyahBay = `9`)
ar <- left_join(ar, df)
write_delim(ar, "results/rarefied.allelecount", delim = "\t")
Calculate allelic richness corrected for sample size using rarefaction.
Distribution of rarefied allele counts per estuary.
Test for significant differences among estuaries using Friedman’s test.
##
## Friedman rank sum test
##
## data: AR and pop and LOCUS
## Friedman chi-squared = 1536.5, df = 8, p-value < 0.00000000000000022
Test for significant pairwise differences between estuaries using Wilcoxon signed rank test to identify patterns of significant differentiation.
Level of significance of pairwise comparisons of distribution of rarefied allelecounts across loci for individuals samples in a given estuary.
Test statistic, p-value, and adjusted p-value of pairwise comparisons of rarefied allele count.
| pop1 | pop2 | stat | p.value | p_adj | COMP |
|---|---|---|---|---|---|
| ApalachicolaBay | MobileBay | 0.05 | 0.95652 | 0.95652 | GULF-GULF |
| StJohnsRiver | WinyahBay | 0.27 | 0.78655 | 0.80903 | ATL-ATL |
| SanAntonioBay | WestMississippiSound | 0.67 | 0.49982 | 0.52922 | GULF-GULF |
| BaratariaBay | SanAntonioBay | 0.81 | 0.41988 | 0.45805 | GULF-GULF |
| MobileBay | BaratariaBay | 0.88 | 0.37780 | 0.42502 | GULF-GULF |
| ApalachicolaBay | BaratariaBay | 1.26 | 0.20901 | 0.24272 | GULF-GULF |
| MobileBay | SanAntonioBay | 1.92 | 0.05451 | 0.06541 | GULF-GULF |
| ApalachicolaBay | SanAntonioBay | 1.97 | 0.04914 | 0.06100 | GULF-GULF |
| BaratariaBay | WestMississippiSound | 2.07 | 0.03887 | 0.04998 | GULF-GULF |
| ApalachicolaBay | WestMississippiSound | 2.85 | 0.00436 | 0.00581 | GULF-GULF |
| CharlestonHarbor | WinyahBay | 3.03 | 0.00245 | 0.00339 | ATL-ATL |
| CharlestonHarbor | StJohnsRiver | 3.09 | 0.00197 | 0.00284 | ATL-ATL |
| MobileBay | WestMississippiSound | 3.21 | 0.00133 | 0.00199 | GULF-GULF |
| SabineLake | ApalachicolaBay | 3.85 | 0.00012 | 0.00018 | GULF-GULF |
| SabineLake | MobileBay | 4.43 | 0.00001 | 0.00002 | GULF-GULF |
| SabineLake | BaratariaBay | 4.92 | 0.00000 | 0.00000 | GULF-GULF |
| SabineLake | SanAntonioBay | 5.10 | 0.00000 | 0.00000 | GULF-GULF |
| SabineLake | WestMississippiSound | 6.27 | 0.00000 | 0.00000 | GULF-GULF |
| WestMississippiSound | CharlestonHarbor | 17.79 | 0.00000 | 0.00000 | GULF-ATL |
| SanAntonioBay | CharlestonHarbor | 17.85 | 0.00000 | 0.00000 | GULF-ATL |
| BaratariaBay | CharlestonHarbor | 18.82 | 0.00000 | 0.00000 | GULF-ATL |
| SanAntonioBay | StJohnsRiver | 18.92 | 0.00000 | 0.00000 | GULF-ATL |
| WestMississippiSound | StJohnsRiver | 19.04 | 0.00000 | 0.00000 | GULF-ATL |
| SanAntonioBay | WinyahBay | 19.90 | 0.00000 | 0.00000 | GULF-ATL |
| ApalachicolaBay | CharlestonHarbor | 20.03 | 0.00000 | 0.00000 | GULF-ATL |
| WestMississippiSound | WinyahBay | 20.08 | 0.00000 | 0.00000 | GULF-ATL |
| MobileBay | CharlestonHarbor | 20.26 | 0.00000 | 0.00000 | GULF-ATL |
| BaratariaBay | StJohnsRiver | 20.45 | 0.00000 | 0.00000 | GULF-ATL |
| SabineLake | CharlestonHarbor | 20.81 | 0.00000 | 0.00000 | GULF-ATL |
| ApalachicolaBay | StJohnsRiver | 21.34 | 0.00000 | 0.00000 | GULF-ATL |
| BaratariaBay | WinyahBay | 21.40 | 0.00000 | 0.00000 | GULF-ATL |
| MobileBay | StJohnsRiver | 21.79 | 0.00000 | 0.00000 | GULF-ATL |
| SabineLake | StJohnsRiver | 22.17 | 0.00000 | 0.00000 | GULF-ATL |
| ApalachicolaBay | WinyahBay | 22.33 | 0.00000 | 0.00000 | GULF-ATL |
| MobileBay | WinyahBay | 22.75 | 0.00000 | 0.00000 | GULF-ATL |
| SabineLake | WinyahBay | 22.82 | 0.00000 | 0.00000 | GULF-ATL |
Determine the ratio of the number of abundant genotypes to the number of rarer genotypes calculated using the ratio of Stoddart & Tayolor index (diversity index weighted for more abundant alleles) & Shannon-Wiener index (diversity index weighted for more rare alleles).
Distribution of evenness per estuary.
Test for significant differences among estuaries using Friedman’s test.
##
## Friedman rank sum test
##
## data: EVENNESS and GRP and LOCUS
## Friedman chi-squared = 3842.3, df = 8, p-value < 0.00000000000000022
Test for significant pairwise differences between estuaries using Wilcoxon signed rank test to identify patterns of significant differences among estuaries.
Level of significance of pairwise comparisons of distribution of allele diversity measured as evenness across loci for individuals samples in a given estuary.
Test statistic, p-value and adjusted p-value of pairwise comparisons of evenness.
| pop1 | pop2 | stat | p.value | p_adj | COMP |
|---|---|---|---|---|---|
| WinyahBay | CharlestonHarbor | 1.35 | 0.17690 | 0.17690 | ATL-ATL |
| CharlestonHarbor | StJohnsRiver | 2.40 | 0.01653 | 0.01700 | ATL-ATL |
| BaratariaBay | ApalachicolaBay | 3.38 | 0.00073 | 0.00078 | GULF-GULF |
| WinyahBay | StJohnsRiver | 3.52 | 0.00042 | 0.00046 | ATL-ATL |
| SanAntonioBay | SabineLake | 4.43 | 0.00001 | 0.00001 | GULF-GULF |
| WestMississippiSound | BaratariaBay | 5.13 | 0.00000 | 0.00000 | GULF-GULF |
| SabineLake | WestMississippiSound | 8.41 | 0.00000 | 0.00000 | GULF-GULF |
| WestMississippiSound | ApalachicolaBay | 8.79 | 0.00000 | 0.00000 | GULF-GULF |
| ApalachicolaBay | MobileBay | 9.53 | 0.00000 | 0.00000 | GULF-GULF |
| BaratariaBay | MobileBay | 12.42 | 0.00000 | 0.00000 | GULF-GULF |
| SanAntonioBay | WestMississippiSound | 12.52 | 0.00000 | 0.00000 | GULF-GULF |
| SabineLake | BaratariaBay | 12.75 | 0.00000 | 0.00000 | GULF-GULF |
| SabineLake | ApalachicolaBay | 15.66 | 0.00000 | 0.00000 | GULF-GULF |
| StJohnsRiver | SanAntonioBay | 17.23 | 0.00000 | 0.00000 | ATL-GULF |
| SanAntonioBay | BaratariaBay | 17.40 | 0.00000 | 0.00000 | GULF-GULF |
| WestMississippiSound | MobileBay | 17.57 | 0.00000 | 0.00000 | GULF-GULF |
| CharlestonHarbor | SanAntonioBay | 19.40 | 0.00000 | 0.00000 | ATL-GULF |
| SanAntonioBay | ApalachicolaBay | 19.89 | 0.00000 | 0.00000 | GULF-GULF |
| WinyahBay | SanAntonioBay | 19.99 | 0.00000 | 0.00000 | ATL-GULF |
| StJohnsRiver | SabineLake | 20.15 | 0.00000 | 0.00000 | ATL-GULF |
| CharlestonHarbor | SabineLake | 21.79 | 0.00000 | 0.00000 | ATL-GULF |
| WinyahBay | SabineLake | 22.65 | 0.00000 | 0.00000 | ATL-GULF |
| SabineLake | MobileBay | 23.81 | 0.00000 | 0.00000 | GULF-GULF |
| StJohnsRiver | WestMississippiSound | 24.84 | 0.00000 | 0.00000 | ATL-GULF |
| CharlestonHarbor | WestMississippiSound | 26.42 | 0.00000 | 0.00000 | ATL-GULF |
| WinyahBay | WestMississippiSound | 26.85 | 0.00000 | 0.00000 | ATL-GULF |
| SanAntonioBay | MobileBay | 26.88 | 0.00000 | 0.00000 | GULF-GULF |
| StJohnsRiver | BaratariaBay | 27.20 | 0.00000 | 0.00000 | ATL-GULF |
| StJohnsRiver | ApalachicolaBay | 28.52 | 0.00000 | 0.00000 | ATL-GULF |
| CharlestonHarbor | BaratariaBay | 28.59 | 0.00000 | 0.00000 | ATL-GULF |
| WinyahBay | BaratariaBay | 29.19 | 0.00000 | 0.00000 | ATL-GULF |
| CharlestonHarbor | ApalachicolaBay | 30.11 | 0.00000 | 0.00000 | ATL-GULF |
| WinyahBay | ApalachicolaBay | 30.54 | 0.00000 | 0.00000 | ATL-GULF |
| StJohnsRiver | MobileBay | 31.82 | 0.00000 | 0.00000 | ATL-GULF |
| CharlestonHarbor | MobileBay | 33.25 | 0.00000 | 0.00000 | ATL-GULF |
| WinyahBay | MobileBay | 33.80 | 0.00000 | 0.00000 | ATL-GULF |
Identify number of fixed loci within sample locations.
Number of fixed alleles in the Atlantic compared to the Gulf.
| GRP | n |
|---|---|
| ATL | 235 |
| GULF | 15 |
Number of fixed alleles per estuary with > 18 individuals based on rarefied allelic richness (estuaries sorted in descending order by number of fixed alleles).
| GRP | n |
|---|---|
| WinyahBay | 513 |
| StJohnsRiver | 454 |
| CharlestonHarbor | 440 |
| SanAntonioBay | 304 |
| SabineLake | 250 |
| WestMississippiSound | 217 |
| BaratariaBay | 173 |
| ApalachicolaBay | 145 |
| MobileBay | 106 |
Compare fixed loci across sampled estuaries in the Gulf and Atlantic. The set size (horizontal green bars) indicates the total number of loci fixed in a given location, the intersection size (vertical orange bars) indicates the number of loci fixed only in a single location (single blue dot) or in two, three, or four locations (indicated by blue dots connected by line).
Compare global allelelic richness per locus for loci fixed in each ocean basin.
Distribution of global allele rarefied allele counts for loci fixed one of the ocean basins.
Compare global allelelic richness per locus for loci that are fixed each sample location.
Distribution of global rarefied allele counts across all individuals for loci fixed in each estuary.
In general, loci fixed in the Gulf are less variable across all individuals compared to loci fixed in the Atlantic with have higher allelic richness in the Gulf.
Tajima’s D, Watterson’s estimator, and nucleotide diversity was calculated for each locus as implemented in pegas (Paradis 2010Paradis, Emmanuel. 2010. “Pegas: An R package for population genetics with an integrated-modular approach.” Oxford University Press. https://doi.org/10.1093/bioinformatics/btp696.).
Get sequence information for loci by reading in reduced representation reference. Create a tidy data set of variant call information (position, alternate alleles) from filtered VCF file used to create haplotypes. Create a data frame of haplotype sequences in the data set(s) using reference contig sequences, VCF information, and codes.out file from rad_haplotyper. Calculate haplotype-based measures, including haplotype diversity, nucleotide diversity, number of segregating sites, and Tajima’s D (test latter for significance).
# Read in the reference sequence
fa <- seqinr::read.fasta("data/REF/reference.fasta")
# create tibble with locus name and sequence
ref <- tibble::tibble(name = seqinr::getName(fa), seq = toupper(unlist(seqinr::getSequence(fa, as.string = TRUE))))
# Read the haplotype VCF file and put it in tidy format
vcf_tidy <- read.vcfR("data/VCF/temp/SFL.recode.vcf") %>%
vcfR2tidy()
# Get the data frame with the positions
pos_tbl <- vcf_tidy$fix %>%
filter(ALT %in% c("A", "T", "C", "G"))
# genind object to use
setPop(gen) <- ~ESTUARY
gen_obj <- gen
# Make a data frame with all of the haplotype sequences from the 'codes.out' file from rad_haplotyper
hap_index <- read_tsv("data/Haplotyping/codes.SFL.gen", col_names = FALSE) %>%
filter(X1 %in% locNames(gen_obj)) %>%
split(.$X1) %>%
purrr::map(function(x) {
codes <- str_split(x$X2, ",") %>%
unlist()
code_tbl <- tibble(locus = x$X1, code = codes) %>%
separate(code, c("hap", "code"), sep = ":") %>%
left_join(ref, by = c("locus" = "name"))
pos <- pos_tbl %>%
filter(CHROM == code_tbl$locus[1]) %>%
pull(POS)
for (i in 1:nrow(code_tbl)) {
alleles <- str_split(code_tbl$hap[i], "") %>%
unlist()
replace <- tibble(pos = pos, allele = alleles)
for (j in 1:nrow(replace)) {
str_sub(code_tbl$seq[i], replace$pos[j], 1) <- replace$allele[j]
}
}
code_tbl
}) %>%
bind_rows()
# set pops
setPop(gen_obj) <- ~ESTUARY
# create tidy data set
gen_tidy <- gen_obj %>%
genind2df(oneColPerAll = TRUE) %>%
rownames_to_column(var = "ind") %>%
gather(allele, code, -pop, -ind) %>%
separate(allele, c("locus", "allele"), sep = "\\.") %>%
filter(pop %in% c("SanAntonioBay", "SabineLake",
"BaratariaBay", "WestMississippiSound", "MobileBay",
"ApalachicolaBay",
"StJohnsRiver", "CharlestonHarbor", "WinyahBay")) %>%
droplevels()
# Calculate haplotype related stats
hap_div_tbl <- locNames(gen_obj) %>%
purrr::map(function(y) {
# for each locus
gen_tidy %>%
filter(locus == y) %>%
filter(!code == "NA") %>%
left_join(hap_index, by = c("code", "locus")) %>%
as.tibble() %>%
arrange(factor(pop, levels = c("SanAntonioBay", "SabineLake",
"BaratariaBay", "WestMississippiSound", "MobileBay",
"ApalachicolaBay",
"StJohnsRiver", "CharlestonHarbor", "WinyahBay"))) %>%
# for each population per locus
split(.$pop) %>%
purrr::map(function(x) {
y <- do.call(rbind, strsplit(x$seq, ""))
rownames(y) <- paste(x$ind, x$allele, sep = ".")
# create DNAbin object for locus
dna_bin <- as.DNAbin(y)
# calculate Tajima's D and p-values
taj_test <- tajima.test(dna_bin)
# create table with results
tibble(locus = x$locus[1],
pop = x$pop[1],
hap_div = hap.div(dna_bin), # calculate haplotype diversity
nuc_div = nuc.div(dna_bin), # calculate nucleotide diversity (pi)
seg_sites = length(seg.sites(dna_bin)), # count number of segregating sites
thetaS = theta.s(x = seg_sites, n = nLoc(gen)), # calculate theta S (watterson 1975)
tajima_d = taj_test$D, # extract Tajima's D
tajima_pval_norm = taj_test$Pval.normal, # extract p-value Tajima's D (normal distribution)
tajima_pval_beta = taj_test$Pval.beta) # extract p-value Tajima's D (beta distribution)
}) %>%
bind_rows()
}) %>%
bind_rows()
write_delim(hap_div_tbl, "results/hap_div_est", delim = "\t")
Genome-wide, null distributions of Tajima’s D were simulated for each estuary, using a coalescent model as executed in MS (Hudson 2002Hudson, Richard R. 2002. “Generating samples under a Wright-Fisher neutral model of genetic variation.” Bioinformatics 18 (2): 337–38. https://doi.org/10.1093/bioinformatics/18.2.337.), to assess whether the observed genome-wide distribution of Tajima’s D values for each estuary was consistent with a neutral, drift-mutation equilibrium. For each simulation a set of neutral loci that consisted of the same number of loci with the same distribution of segregating sites as in the observed data (grouper by estuary) was generated. The difference between empirical and simulated distributions in each estuary was then assessed by generating 1,000 simulated null distributions of Tajima’s D for each estuary and comparing the mean and median values of the empirical distribution with the mean and median of each of the 1,000 simulated distributions. Significance was assessed by estimating the probability that the observed values fall into the distribution of simulated values (means and medians).
Significance (departure from neutrality, assuming genome is in equilibrium) tested per locus in each location, assuming a beta distribution (Tajima 1989Tajima, F. 1989. “Statistical method for testing the neutral mutation hypothesis by DNA polymorphism.” Genetics 123 (3): 585–595. https://doi.org/10.1534/genetics.107.079319.), as implemented in pegas. The number of significant (P < 0.05) loci, positive and negative, were then summarized by estuary. The distribution of loci with significant Tajima’s D values, both positive and negative, across linkage groups was assessed using those loci that previously were incorporated into a linkage map (O’Leary, et al. 2018).
\(\theta\) is the population-scaled mutation rate and can be estimated as \(\hat{\theta}_T\) (Nei 1987Nei, Masatoshi. 1987. Molecular Evolutionary Genetics. https://doi.org/10.7312/nei-92038.) as the number of pairwise differences (Tajima’s estimator or \(\pi\)), while \(\hat{\theta}_W\) is the number of segregating sites (Watterson’s estimator or \(S\), (Watterson 1975Watterson, G. A. 1975. “On the number of segregating sites in genetical models without recombination.” Theoretical Population Biology 7 (2): 256–76. https://doi.org/10.1016/0040-5809(75)90020-9.) . Because \(\hat{\theta}_T\) will underestimate the number of mutations that are rare in the population, Tajima’s \(D\) can be used to test the neutral mutation hypothesis. A negative Tajima’s \(D\) is indicative of positive selection while for balancing selection, alleles are kept at intermediate frequencies resulting in Tajima’s \(D\) becoming positive. To better understand what was driving patterns in neutrality tests , Θ was calculated based on the number of segregating sites as \(\hat{\theta}_W\) and based on pairwise differences among haplotypes (nucleotide diversity), and the mean and standard deviation compared across estuaries.
Identify null distribution of Tajima’s \(D\) across a genome for set of neutral loci (same number of loci with same distribution of segregating sites as empirical data set) using loci simuated under coalescence using MS (Hudson 2002Hudson, Richard R. 2002. “Generating samples under a Wright-Fisher neutral model of genetic variation.” Bioinformatics 18 (2): 337–38. https://doi.org/10.1093/bioinformatics/18.2.337.).
SanAntonioBay
Simulate 1000 independent replicates (loci) for 1 - 36 segregating sites under neutral model for 23 individuals (46 observations) for no population expansion and for population expansion for alpha = 10, 30, 90.
Population size is given by: N(t) = N0 exp(alpha*t), for t = time before the present, measured in units of 4N0 generations.
# simulated neutral loci
for i in {1..36}
do
scr/MS/ms 46 1000 -s $i | scr/MS/sample_stats > results/Ind23_Sites${i}_a0_Rep1000.stats
done
# simulated neutral loci with population expansion for alpha
for a in 5 10 30 90
do
for i in {1..36}
do
scr/MS/ms 46 1000 -s $i -G $a | scr/MS/sample_stats > results/Ind23_Sites${i}_a${a}_Rep1000.stats
done
done
Create simulated data set based on empirical data set, i.e. determine number of loci with 1 … x segregating sites, then randomly draw that number of loci from simulated data set for each class of segregating loci for neutral loci w/no population growth.
# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind23")
# create an empty list that will serve as a container to receive the incoming files
sim <-list()
# create a loop to read in your data
for (i in 1:length(filenames)){
sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
col_names = c("t1", "pi",
"t2", "seg_sites",
"t3", "tajima_d",
"t4", "thetaH",
"t5", "H")) %>%
select(pi, seg_sites, tajima_d, thetaH, H)
}
# add names
names(sim) <- filenames
# combine into dataframe
sim <- ldply(sim, data.frame) %>%
rename(simulation = `.id`) %>%
separate(simulation, into = c("temp", "t1"), sep = -6) %>%
separate(temp, into = c("N_IND", "t2", "ALPHA", "N_REPS"), sep = "_") %>%
mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+")),
ALPHA = as.numeric(str_extract(ALPHA, "[[:digit:]]+"))) %>%
select(-t1, -t2)
# set number of cores to run in parallel
registerDoMC(50)
reps <- 1000
# group to be compared
grp <- "SanAntonioBay"
# population growth (alpha)
alpha <- c(0, 5, 10, 30, 90)
# get counts for number of segregating sites
emp <- read_delim("results/hap_div_est", delim = "\t") %>%
select(locus, pop, seg_sites) %>%
filter(!seg_sites == 0 & seg_sites < 35) %>%
filter(pop == grp) %>%
count(seg_sites)
null_distributions <- list()
for(a in alpha) {
# replicate draws to match distribution in parallel
results <- foreach(i = 1:reps) %dopar% {
sim_data <- list()
for (s in emp$seg_sites) {
# determine number of loci with s segregating sites
t <- emp %>%
filter(seg_sites == s)
n_loci <- t$n
# sample n_loci simulated loci (with replacement)
sim_data[[s]] <- sim %>%
filter(ALPHA == a) %>%
filter(seg_sites == s) %>%
sample_n(size = n_loci, replace = TRUE) %>%
mutate(data_set = paste("sim_", as.character(i), sep = ""))
}
sim_data <- ldply(sim_data, data.frame) %>%
select(seg_sites, ALPHA, tajima_d, data_set)
}
null_distributions[[as.character(a)]] <- ldply(results, data.frame)
}
null_distributions <- ldply(null_distributions, data.frame) %>%
select(-`.id`)
write_delim(null_distributions, "results/SA.nulldist", delim = "\t")
Compare empirical and simulated data sets:
Cumulative distribution of Tajima’s D for empirical and simulated data sets for a range of population expansion values.
Estimate p-values to determine if parameters describing the distribution of Tajima’s D across the genome is significantly different (smaller/larger) than simulated null distributions.
param <- taj %>%
group_by(ALPHA, data_set) %>%
summarise(mean = mean(tajima_d),
std = sd(tajima_d),
SE = std.error(tajima_d),
quantile_10 = quantile(tajima_d, 0.1, na.rm=TRUE),
quantile_25 = quantile(tajima_d, 0.25, na.rm=TRUE),
median = median(tajima_d),
quantile_75 = quantile(tajima_d, 0.75, na.rm=TRUE),
quantile_90 = quantile(tajima_d, 0.90, na.rm=TRUE)) %>%
gather(key = PARAM, value = VALUE, 3:10) %>%
ungroup()
# number of permutations
nperm <- 1000
# parameters to assess p-value for
stats <- unique(param$PARAM)
# values for alpha (population growth parameter)
alpha <- c(0, 5, 10, 30, 90)
# dataframe for results
sign <- setNames(data.frame(matrix(ncol = 6, nrow = 0)),
c("ALPHA", "PARAM", "OBS", "SIM", "PVALsmaller", "PVALlarger")) %>%
mutate(ALPHA = as.character(ALPHA),
PARAM = as.character(PARAM),
OBS = as.numeric(OBS),
SIM = as.numeric(SIM),
PVALsmaller = as.numeric(PVALsmaller),
PVALlarger = as.numeric(PVALlarger))
for(p in stats) {
for(a in alpha) {
# observed parameters
obs <- param %>%
filter(data_set == "empirical",
PARAM == p)
# simulated parameters
sim <- param %>%
filter(ALPHA == a,
PARAM == p)
# count number of times simulated value < observed value
obs_value <- obs$VALUE
smaller <- sim %>%
filter(VALUE < obs_value) %>%
nrow()
# count number of times simulated value > observed value
obs_value <- obs$VALUE
larger <- sim %>%
filter(VALUE > obs_value) %>%
nrow()
temp <- obs %>%
select(PARAM, VALUE) %>%
rename(OBS = VALUE) %>%
mutate(SIM = mean(sim$VALUE),
PVALsmaller = smaller/nperm,
PVALlarger = larger/nperm,
ALPHA = as.character(a))
sign <- bind_rows(sign, temp)
}
}
kable(
sign %>%
arrange(ALPHA),
caption = "Summary statistics of the distribution of Tajima's D."
)
Summary statistics of the distribution of Tajima’s D.
| ALPHA | PARAM | OBS | SIM | PVALsmaller | PVALlarger |
|---|---|---|---|---|---|
| 0 | mean | -0.6312303 | -0.0567930 | 0.000 | 1.000 |
| 0 | std | 0.8072930 | 0.9690807 | 0.000 | 1.000 |
| 0 | SE | 0.0130771 | 0.0156979 | 0.000 | 1.000 |
| 0 | quantile_10 | -1.5194315 | -1.1977150 | 0.000 | 1.000 |
| 0 | quantile_25 | -1.1840145 | -0.8467111 | 0.000 | 1.000 |
| 0 | median | -0.8235974 | -0.1778511 | 0.000 | 1.000 |
| 0 | quantile_75 | -0.1778705 | 0.5947352 | 0.000 | 1.000 |
| 0 | quantile_90 | 0.4846081 | 1.3592651 | 0.000 | 1.000 |
| 10 | mean | -0.6312303 | -0.6871890 | 1.000 | 0.000 |
| 10 | std | 0.8072930 | 0.7655427 | 1.000 | 0.000 |
| 10 | SE | 0.0130771 | 0.0124008 | 1.000 | 0.000 |
| 10 | quantile_10 | -1.5194315 | -1.5232487 | 0.523 | 0.477 |
| 10 | quantile_25 | -1.1840145 | -1.2173933 | 0.989 | 0.011 |
| 10 | median | -0.8235974 | -0.8523805 | 0.992 | 0.008 |
| 10 | quantile_75 | -0.1778705 | -0.2593794 | 1.000 | 0.000 |
| 10 | quantile_90 | 0.4846081 | 0.3760594 | 1.000 | 0.000 |
| 30 | mean | -0.6312303 | -0.9780005 | 1.000 | 0.000 |
| 30 | std | 0.8072930 | 0.6730369 | 1.000 | 0.000 |
| 30 | SE | 0.0130771 | 0.0109023 | 1.000 | 0.000 |
| 30 | quantile_10 | -1.5194315 | -1.7052337 | 1.000 | 0.000 |
| 30 | quantile_25 | -1.1840145 | -1.4638000 | 1.000 | 0.000 |
| 30 | median | -0.8235974 | -1.1110330 | 1.000 | 0.000 |
| 30 | quantile_75 | -0.1778705 | -0.6352838 | 1.000 | 0.000 |
| 30 | quantile_90 | 0.4846081 | -0.0757024 | 1.000 | 0.000 |
| 5 | mean | -0.6312303 | -0.5342840 | 0.000 | 1.000 |
| 5 | std | 0.8072930 | 0.8039724 | 0.623 | 0.377 |
| 5 | SE | 0.0130771 | 0.0130233 | 0.623 | 0.377 |
| 5 | quantile_10 | -1.5194315 | -1.4612264 | 0.000 | 1.000 |
| 5 | quantile_25 | -1.1840145 | -1.1114426 | 0.000 | 1.000 |
| 5 | median | -0.8235974 | -0.6728194 | 0.000 | 1.000 |
| 5 | quantile_75 | -0.1778705 | -0.0491834 | 0.000 | 1.000 |
| 5 | quantile_90 | 0.4846081 | 0.5833725 | 0.001 | 0.999 |
| 90 | mean | -0.6312303 | -1.1783319 | 1.000 | 0.000 |
| 90 | std | 0.8072930 | 0.6312627 | 1.000 | 0.000 |
| 90 | SE | 0.0130771 | 0.0102256 | 1.000 | 0.000 |
| 90 | quantile_10 | -1.5194315 | -1.8533453 | 1.000 | 0.000 |
| 90 | quantile_25 | -1.1840145 | -1.5840429 | 1.000 | 0.000 |
| 90 | median | -0.8235974 | -1.2972156 | 1.000 | 0.000 |
| 90 | quantile_75 | -0.1778705 | -0.8968364 | 1.000 | 0.000 |
| 90 | quantile_90 | 0.4846081 | -0.3890550 | 1.000 | 0.000 |
Sabine Lake
Simulate 1000 independent replicates (loci) for 1 - 36 segregating sites under neutral model for 24 individuals (48 observations) for no population expansion and for population expansion for alpha = 5, 10, 30, 90.
# simulated neutral loci
for i in {1..36}
do
scr/MS/ms 48 1000 -s $i | scr/MS/sample_stats > results/Ind24_Sites${i}_a0_Rep1000.stats
done
# simulated neutral loci with population expansion for alpha
for a in 5 10 30 90
do
for i in {1..36}
do
scr/MS/ms 48 1000 -s $i -G $a | scr/MS/sample_stats > results/Ind24_Sites${i}_a${a}_Rep1000.stats
done
done
Import results to compare null distribution of Tajima’s D for each locus according to number of segregating sites.
Create simulated data set based on empirical data set, i.e. determine number of loci with 1 … x segregating sites, then randomly draw that number of loci from simulated data set for each class of segregating loci for neutral loci w/no population growth.
Cumulative distribution of Tajima’s D for empirical and simulated data sets for a range of population expansion values.
Compare distributions of empirical and simulated data sets to determine if parameters describing the empirical distributions is significantly larger or smaller than the simulated null distribution.
Summary statistics of the distribution of Tajima’s D.
| ALPHA | PARAM | OBS | SIM | PVALsmaller | PVALlarger |
|---|---|---|---|---|---|
| 0 | mean | -0.6889841 | -0.0785965 | 0.000 | 1.000 |
| 0 | std | 0.7920153 | 0.9618471 | 0.000 | 1.000 |
| 0 | SE | 0.0127331 | 0.0154635 | 0.000 | 1.000 |
| 0 | quantile_10 | -1.5665469 | -1.2253161 | 0.000 | 1.000 |
| 0 | quantile_25 | -1.2405739 | -0.8631777 | 0.000 | 1.000 |
| 0 | median | -0.8621855 | -0.1849718 | 0.000 | 1.000 |
| 0 | quantile_75 | -0.2525410 | 0.5691393 | 0.000 | 1.000 |
| 0 | quantile_90 | 0.3927034 | 1.2872139 | 0.000 | 1.000 |
| 10 | mean | -0.6889841 | -0.7172501 | 0.991 | 0.009 |
| 10 | std | 0.7920153 | 0.7552064 | 1.000 | 0.000 |
| 10 | SE | 0.0127331 | 0.0121413 | 1.000 | 0.000 |
| 10 | quantile_10 | -1.5665469 | -1.5605983 | 0.559 | 0.441 |
| 10 | quantile_25 | -1.2405739 | -1.2537852 | 0.802 | 0.198 |
| 10 | median | -0.8621855 | -0.8672021 | 0.991 | 0.009 |
| 10 | quantile_75 | -0.2525410 | -0.2926859 | 0.972 | 0.028 |
| 10 | quantile_90 | 0.3927034 | 0.3464750 | 0.983 | 0.017 |
| 30 | mean | -0.6889841 | -0.9844898 | 1.000 | 0.000 |
| 30 | std | 0.7920153 | 0.6741268 | 1.000 | 0.000 |
| 30 | SE | 0.0127331 | 0.0108378 | 1.000 | 0.000 |
| 30 | quantile_10 | -1.5665469 | -1.7161487 | 1.000 | 0.000 |
| 30 | quantile_25 | -1.2405739 | -1.4668954 | 1.000 | 0.000 |
| 30 | median | -0.8621855 | -1.1068650 | 1.000 | 0.000 |
| 30 | quantile_75 | -0.2525410 | -0.6472442 | 1.000 | 0.000 |
| 30 | quantile_90 | 0.3927034 | -0.0792495 | 1.000 | 0.000 |
| 5 | mean | -0.6889841 | -0.5435107 | 0.000 | 1.000 |
| 5 | std | 0.7920153 | 0.7996179 | 0.206 | 0.794 |
| 5 | SE | 0.0127331 | 0.0128553 | 0.206 | 0.794 |
| 5 | quantile_10 | -1.5665469 | -1.4670794 | 0.000 | 1.000 |
| 5 | quantile_25 | -1.2405739 | -1.1129919 | 0.000 | 1.000 |
| 5 | median | -0.8621855 | -0.6719187 | 0.000 | 1.000 |
| 5 | quantile_75 | -0.2525410 | -0.0418688 | 0.000 | 1.000 |
| 5 | quantile_90 | 0.3927034 | 0.5586672 | 0.000 | 1.000 |
| 90 | mean | -0.6889841 | -1.2001224 | 1.000 | 0.000 |
| 90 | std | 0.7920153 | 0.6137250 | 1.000 | 0.000 |
| 90 | SE | 0.0127331 | 0.0098668 | 1.000 | 0.000 |
| 90 | quantile_10 | -1.5665469 | -1.8655387 | 1.000 | 0.000 |
| 90 | quantile_25 | -1.2405739 | -1.6084470 | 1.000 | 0.000 |
| 90 | median | -0.8621855 | -1.3063056 | 1.000 | 0.000 |
| 90 | quantile_75 | -0.2525410 | -0.9370921 | 1.000 | 0.000 |
| 90 | quantile_90 | 0.3927034 | -0.4157354 | 1.000 | 0.000 |
Barataria Bay
Simulate 1000 independent replicates (loci) for 1 - 36 segregating sites under neutral model for 39 individuals (78 observations) for no population expansion and for population expansion for alpha = 5, 10, 30, 90.
# simulated neutral loci
for i in {1..36}
do
scr/MS/ms 78 1000 -s $i | scr/MS/sample_stats > results/Ind39_Sites${i}_a0_Rep1000.stats
done
# simulated neutral loci with population expansion for alpha
for a in 5 10 30 90
do
for i in {1..36}
do
scr/MS/ms 78 1000 -s $i -G $a | scr/MS/sample_stats > results/Ind39_Sites${i}_a${a}_Rep1000.stats
done
done
Import results to compare null distribution of Tajima’s D for each locus according to number of segregating sites.
Create simulated data set based on empirical data set, i.e. determine number of loci with 1 … x segregating sites, then randomly draw that number of loci from simulated data set for each class of segregating loci for neutral loci w/no population growth.
Cumulative distribution of Tajima’s D for empirical and simulated data sets for a range of population expansion values.
Compare distributions of empirical and simulated data sets to determine if parameters describing the empirical distributions is significantly larger or smaller than the simulated null distribution.
Summary statistics of the distribution of Tajima’s D.
| ALPHA | PARAM | OBS | SIM | PVALsmaller | PVALlarger |
|---|---|---|---|---|---|
| 0 | mean | -0.6707489 | -0.0636089 | 0.000 | 1.000 |
| 0 | std | 0.7885400 | 0.9522225 | 0.000 | 1.000 |
| 0 | SE | 0.0125577 | 0.0151644 | 0.000 | 1.000 |
| 0 | quantile_10 | -1.5262175 | -1.1930471 | 0.000 | 1.000 |
| 0 | quantile_25 | -1.2360644 | -0.8211557 | 0.000 | 1.000 |
| 0 | median | -0.8441499 | -0.1800690 | 0.000 | 1.000 |
| 0 | quantile_75 | -0.2325124 | 0.5698343 | 0.000 | 1.000 |
| 0 | quantile_90 | 0.3861367 | 1.3179328 | 0.000 | 1.000 |
| 10 | mean | -0.6707489 | -0.7776915 | 1.000 | 0.000 |
| 10 | std | 0.7885400 | 0.7196294 | 1.000 | 0.000 |
| 10 | SE | 0.0125577 | 0.0114603 | 1.000 | 0.000 |
| 10 | quantile_10 | -1.5262175 | -1.5674669 | 0.999 | 0.001 |
| 10 | quantile_25 | -1.2360644 | -1.2925806 | 0.999 | 0.001 |
| 10 | median | -0.8441499 | -0.9201910 | 1.000 | 0.000 |
| 10 | quantile_75 | -0.2325124 | -0.3926858 | 1.000 | 0.000 |
| 10 | quantile_90 | 0.3861367 | 0.2190113 | 1.000 | 0.000 |
| 30 | mean | -0.6707489 | -1.0267308 | 1.000 | 0.000 |
| 30 | std | 0.7885400 | 0.6473541 | 1.000 | 0.000 |
| 30 | SE | 0.0125577 | 0.0103093 | 1.000 | 0.000 |
| 30 | quantile_10 | -1.5262175 | -1.7253802 | 1.000 | 0.000 |
| 30 | quantile_25 | -1.2360644 | -1.4641894 | 1.000 | 0.000 |
| 30 | median | -0.8441499 | -1.1077578 | 1.000 | 0.000 |
| 30 | quantile_75 | -0.2325124 | -0.7477087 | 1.000 | 0.000 |
| 30 | quantile_90 | 0.3861367 | -0.1837284 | 1.000 | 0.000 |
| 5 | mean | -0.6707489 | -0.5837980 | 0.000 | 1.000 |
| 5 | std | 0.7885400 | 0.7795619 | 0.805 | 0.195 |
| 5 | SE | 0.0125577 | 0.0124147 | 0.805 | 0.195 |
| 5 | quantile_10 | -1.5262175 | -1.4374738 | 0.000 | 1.000 |
| 5 | quantile_25 | -1.2360644 | -1.1272069 | 0.000 | 1.000 |
| 5 | median | -0.8441499 | -0.7477897 | 0.000 | 1.000 |
| 5 | quantile_75 | -0.2325124 | -0.1516918 | 0.000 | 1.000 |
| 5 | quantile_90 | 0.3861367 | 0.5134716 | 0.000 | 1.000 |
| 90 | mean | -0.6707489 | -1.2524127 | 1.000 | 0.000 |
| 90 | std | 0.7885400 | 0.5745890 | 1.000 | 0.000 |
| 90 | SE | 0.0125577 | 0.0091505 | 1.000 | 0.000 |
| 90 | quantile_10 | -1.5262175 | -1.8825303 | 1.000 | 0.000 |
| 90 | quantile_25 | -1.2360644 | -1.6479069 | 1.000 | 0.000 |
| 90 | median | -0.8441499 | -1.3320628 | 1.000 | 0.000 |
| 90 | quantile_75 | -0.2325124 | -1.0206349 | 1.000 | 0.000 |
| 90 | quantile_90 | 0.3861367 | -0.5704469 | 1.000 | 0.000 |
West Mississippi Sound
Simulate 1000 independent replicates (loci) for 1 - 36 segregating sites under neutral model for 34 individuals (68 observations) for no population expansion and for population expansion for alpha = 5, 10, 30, 90.
# simulated neutral loci
for i in {1..36}
do
scr/MS/ms 68 1000 -s $i | scr/MS/sample_stats > results/Ind34_Sites${i}_a0_Rep1000.stats
done
# simulated neutral loci with population expansion for alpha
for a in 5 10 30 90
do
for i in {1..36}
do
scr/MS/ms 68 1000 -s $i -G $a | scr/MS/sample_stats > results/Ind34_Sites${i}_a${a}_Rep1000.stats
done
done
Import results to compare null distribution of Tajima’s D for each locus according to number of segregating sites.
Create simulated data set based on empirical data set, i.e. determine number of loci with 1 … x segregating sites, then randomly draw that number of loci from simulated data set for each class of segregating loci for neutral loci w/no population growth.
Cumulative distribution of Tajima’s D for empirical and simulated data sets for a range of population expansion values.
Compare distributions of empirical and simulated data sets to determine if parameters describing the empirical distributions is significantly larger or smaller than the simulated null distribution.
Summary statistics of the distribution of Tajima’s D.
| ALPHA | PARAM | OBS | SIM | PVALsmaller | PVALlarger |
|---|---|---|---|---|---|
| 0 | mean | -0.6647244 | -0.0678521 | 0.000 | 1.000 |
| 0 | std | 0.7930950 | 0.9555513 | 0.000 | 1.000 |
| 0 | SE | 0.0127176 | 0.0153227 | 0.000 | 1.000 |
| 0 | quantile_10 | -1.5285687 | -1.1962915 | 0.000 | 1.000 |
| 0 | quantile_25 | -1.2234043 | -0.8215735 | 0.000 | 1.000 |
| 0 | median | -0.8347845 | -0.1906789 | 0.000 | 1.000 |
| 0 | quantile_75 | -0.2189918 | 0.5547455 | 0.000 | 1.000 |
| 0 | quantile_90 | 0.4300502 | 1.3257402 | 0.000 | 1.000 |
| 10 | mean | -0.6647244 | -0.7509188 | 1.000 | 0.000 |
| 10 | std | 0.7930950 | 0.7298519 | 1.000 | 0.000 |
| 10 | SE | 0.0127176 | 0.0117035 | 1.000 | 0.000 |
| 10 | quantile_10 | -1.5285687 | -1.5552237 | 0.994 | 0.006 |
| 10 | quantile_25 | -1.2234043 | -1.2737350 | 0.998 | 0.002 |
| 10 | median | -0.8347845 | -0.9001907 | 1.000 | 0.000 |
| 10 | quantile_75 | -0.2189918 | -0.3570284 | 1.000 | 0.000 |
| 10 | quantile_90 | 0.4300502 | 0.2557269 | 1.000 | 0.000 |
| 30 | mean | -0.6647244 | -1.0328754 | 1.000 | 0.000 |
| 30 | std | 0.7930950 | 0.6477537 | 1.000 | 0.000 |
| 30 | SE | 0.0127176 | 0.0103870 | 1.000 | 0.000 |
| 30 | quantile_10 | -1.5285687 | -1.7289601 | 1.000 | 0.000 |
| 30 | quantile_25 | -1.2234043 | -1.4757960 | 1.000 | 0.000 |
| 30 | median | -0.8347845 | -1.1076278 | 1.000 | 0.000 |
| 30 | quantile_75 | -0.2189918 | -0.7367943 | 1.000 | 0.000 |
| 30 | quantile_90 | 0.4300502 | -0.1739533 | 1.000 | 0.000 |
| 5 | mean | -0.6647244 | -0.5836971 | 0.000 | 1.000 |
| 5 | std | 0.7930950 | 0.7829934 | 0.853 | 0.147 |
| 5 | SE | 0.0127176 | 0.0125556 | 0.853 | 0.147 |
| 5 | quantile_10 | -1.5285687 | -1.4581533 | 0.000 | 1.000 |
| 5 | quantile_25 | -1.2234043 | -1.1342969 | 0.000 | 1.000 |
| 5 | median | -0.8347845 | -0.7362509 | 0.000 | 1.000 |
| 5 | quantile_75 | -0.2189918 | -0.1284639 | 0.000 | 1.000 |
| 5 | quantile_90 | 0.4300502 | 0.5129576 | 0.000 | 1.000 |
| 90 | mean | -0.6647244 | -1.2437423 | 1.000 | 0.000 |
| 90 | std | 0.7930950 | 0.5780765 | 1.000 | 0.000 |
| 90 | SE | 0.0127176 | 0.0092697 | 1.000 | 0.000 |
| 90 | quantile_10 | -1.5285687 | -1.8759689 | 1.000 | 0.000 |
| 90 | quantile_25 | -1.2234043 | -1.6546031 | 1.000 | 0.000 |
| 90 | median | -0.8347845 | -1.3208527 | 1.000 | 0.000 |
| 90 | quantile_75 | -0.2189918 | -1.0033864 | 1.000 | 0.000 |
| 90 | quantile_90 | 0.4300502 | -0.5375963 | 1.000 | 0.000 |
Mobile Bay
Simulate 1000 independent replicates (loci) for 1 - 36 segregating sites under neutral model for 62 individuals (48 observations) for no population expansion and for population expansion for alpha = 5, 10, 30, 90.
# simulated neutral loci
for i in {1..36}
do
scr/MS/ms 124 1000 -s $i | scr/MS/sample_stats > results/Ind62_Sites${i}_a0_Rep1000.stats
done
# simulated neutral loci with population expansion for alpha
for a in 5 10 30 90
do
for i in {1..36}
do
scr/MS/ms 124 1000 -s $i -G $a | scr/MS/sample_stats > results/Ind62_Sites${i}_a${a}_Rep1000.stats
done
done
Import results to compare null distribution of Tajima’s D for each locus according to number of segregating sites.
Create simulated data set based on empirical data set, i.e. determine number of loci with 1 … x segregating sites, then randomly draw that number of loci from simulated data set for each class of segregating loci for neutral loci w/no population growth.
Cumulative distribution of Tajima’s D for empirical and simulated data sets for a range of population expansion values.
Compare distributions of empirical and simulated data sets to determine if parameters describing the empirical distributions is significantly larger or smaller than the simulated null distribution.
Summary statistics of the distribution of Tajima’s D.
| ALPHA | PARAM | OBS | SIM | PVALsmaller | PVALlarger |
|---|---|---|---|---|---|
| 0 | mean | -0.6717212 | -0.0689478 | 0.000 | 1.000 |
| 0 | std | 0.7758586 | 0.9639020 | 0.000 | 1.000 |
| 0 | SE | 0.0122720 | 0.0152463 | 0.000 | 1.000 |
| 0 | quantile_10 | -1.5051528 | -1.2088858 | 0.000 | 1.000 |
| 0 | quantile_25 | -1.2227716 | -0.8319931 | 0.000 | 1.000 |
| 0 | median | -0.8432089 | -0.1899281 | 0.000 | 1.000 |
| 0 | quantile_75 | -0.2830638 | 0.5427770 | 0.000 | 1.000 |
| 0 | quantile_90 | 0.3850940 | 1.3022929 | 0.000 | 1.000 |
| 10 | mean | -0.6717212 | -0.8015301 | 1.000 | 0.000 |
| 10 | std | 0.7758586 | 0.6954500 | 1.000 | 0.000 |
| 10 | SE | 0.0122720 | 0.0110002 | 1.000 | 0.000 |
| 10 | quantile_10 | -1.5051528 | -1.5745937 | 1.000 | 0.000 |
| 10 | quantile_25 | -1.2227716 | -1.2916222 | 1.000 | 0.000 |
| 10 | median | -0.8432089 | -0.9235151 | 1.000 | 0.000 |
| 10 | quantile_75 | -0.2830638 | -0.4427216 | 1.000 | 0.000 |
| 10 | quantile_90 | 0.3850940 | 0.1487216 | 1.000 | 0.000 |
| 30 | mean | -0.6717212 | -1.0753747 | 1.000 | 0.000 |
| 30 | std | 0.7758586 | 0.6188381 | 1.000 | 0.000 |
| 30 | SE | 0.0122720 | 0.0097884 | 1.000 | 0.000 |
| 30 | quantile_10 | -1.5051528 | -1.7487884 | 1.000 | 0.000 |
| 30 | quantile_25 | -1.2227716 | -1.4994872 | 1.000 | 0.000 |
| 30 | median | -0.8432089 | -1.1673112 | 1.000 | 0.000 |
| 30 | quantile_75 | -0.2830638 | -0.8075808 | 1.000 | 0.000 |
| 30 | quantile_90 | 0.3850940 | -0.2927997 | 1.000 | 0.000 |
| 5 | mean | -0.6717212 | -0.6224272 | 0.000 | 1.000 |
| 5 | std | 0.7758586 | 0.7627542 | 0.896 | 0.104 |
| 5 | SE | 0.0122720 | 0.0120647 | 0.896 | 0.104 |
| 5 | quantile_10 | -1.5051528 | -1.4563755 | 0.000 | 1.000 |
| 5 | quantile_25 | -1.2227716 | -1.1662361 | 0.000 | 1.000 |
| 5 | median | -0.8432089 | -0.7811226 | 0.000 | 1.000 |
| 5 | quantile_75 | -0.2830638 | -0.2070992 | 0.000 | 1.000 |
| 5 | quantile_90 | 0.3850940 | 0.4422204 | 0.042 | 0.958 |
| 90 | mean | -0.6717212 | -1.2901302 | 1.000 | 0.000 |
| 90 | std | 0.7758586 | 0.5608594 | 1.000 | 0.000 |
| 90 | SE | 0.0122720 | 0.0088713 | 1.000 | 0.000 |
| 90 | quantile_10 | -1.5051528 | -1.8965090 | 1.000 | 0.000 |
| 90 | quantile_25 | -1.2227716 | -1.6774317 | 1.000 | 0.000 |
| 90 | median | -0.8432089 | -1.3533690 | 1.000 | 0.000 |
| 90 | quantile_75 | -0.2830638 | -1.0039278 | 1.000 | 0.000 |
| 90 | quantile_90 | 0.3850940 | -0.6480351 | 1.000 | 0.000 |
Apalachicola Bay
Simulate 1000 independent replicates (loci) for 1 - 36 segregating sites under neutral model for 44 individuals (48 observations) for no population expansion and for population expansion for alpha = 5, 10, 30, 90.
# simulated neutral loci
for i in {1..36}
do
scr/MS/ms 88 1000 -s $i | scr/MS/sample_stats > results/Ind44_Sites${i}_a0_Rep1000.stats
done
# simulated neutral loci with population expansion for alpha
for a in 5 10 30 90
do
for i in {1..36}
do
scr/MS/ms 88 1000 -s $i -G $a | scr/MS/sample_stats > results/Ind44_Sites${i}_a${a}_Rep1000.stats
done
done
Import results to compare null distribution of Tajima’s D for each locus according to number of segregating sites.
Create simulated data set based on empirical data set, i.e. determine number of loci with 1 … x segregating sites, then randomly draw that number of loci from simulated data set for each class of segregating loci for neutral loci w/no population growth.
Cumulative distribution of Tajima’s D for empirical and simulated data sets for a range of population expansion values.
Compare distributions of empirical and simulated data sets to determine if parameters describing the empirical distributions is significantly larger or smaller than the simulated null distribution.
Summary statistics of the distribution of Tajima’s D.
| ALPHA | PARAM | OBS | SIM | PVALsmaller | PVALlarger |
|---|---|---|---|---|---|
| 0 | mean | -0.6661067 | -0.0896773 | 0.000 | 1.000 |
| 0 | std | 0.7805748 | 0.9503664 | 0.000 | 1.000 |
| 0 | SE | 0.0123995 | 0.0150966 | 0.000 | 1.000 |
| 0 | quantile_10 | -1.4907011 | -1.2059490 | 0.000 | 1.000 |
| 0 | quantile_25 | -1.2205584 | -0.8521527 | 0.000 | 1.000 |
| 0 | median | -0.8291125 | -0.1987122 | 0.000 | 1.000 |
| 0 | quantile_75 | -0.2722850 | 0.5167723 | 0.000 | 1.000 |
| 0 | quantile_90 | 0.4035077 | 1.2789622 | 0.000 | 1.000 |
| 10 | mean | -0.6661067 | -0.7805077 | 1.000 | 0.000 |
| 10 | std | 0.7805748 | 0.7153168 | 1.000 | 0.000 |
| 10 | SE | 0.0123995 | 0.0113628 | 1.000 | 0.000 |
| 10 | quantile_10 | -1.4907011 | -1.5656684 | 1.000 | 0.000 |
| 10 | quantile_25 | -1.2205584 | -1.2950439 | 1.000 | 0.000 |
| 10 | median | -0.8291125 | -0.9165530 | 1.000 | 0.000 |
| 10 | quantile_75 | -0.2722850 | -0.4092971 | 1.000 | 0.000 |
| 10 | quantile_90 | 0.4035077 | 0.2072104 | 1.000 | 0.000 |
| 30 | mean | -0.6661067 | -1.0386845 | 1.000 | 0.000 |
| 30 | std | 0.7805748 | 0.6376537 | 1.000 | 0.000 |
| 30 | SE | 0.0123995 | 0.0101291 | 1.000 | 0.000 |
| 30 | quantile_10 | -1.4907011 | -1.7392293 | 1.000 | 0.000 |
| 30 | quantile_25 | -1.2205584 | -1.4780679 | 1.000 | 0.000 |
| 30 | median | -0.8291125 | -1.1145536 | 1.000 | 0.000 |
| 30 | quantile_75 | -0.2722850 | -0.7612880 | 1.000 | 0.000 |
| 30 | quantile_90 | 0.4035077 | -0.2119599 | 1.000 | 0.000 |
| 5 | mean | -0.6661067 | -0.5990917 | 0.000 | 1.000 |
| 5 | std | 0.7805748 | 0.7831740 | 0.404 | 0.596 |
| 5 | SE | 0.0123995 | 0.0124407 | 0.404 | 0.596 |
| 5 | quantile_10 | -1.4907011 | -1.4651840 | 0.012 | 0.988 |
| 5 | quantile_25 | -1.2205584 | -1.1485910 | 0.000 | 1.000 |
| 5 | median | -0.8291125 | -0.7650752 | 0.000 | 1.000 |
| 5 | quantile_75 | -0.2722850 | -0.1606926 | 0.000 | 1.000 |
| 5 | quantile_90 | 0.4035077 | 0.4906106 | 0.002 | 0.998 |
| 90 | mean | -0.6661067 | -1.2487284 | 1.000 | 0.000 |
| 90 | std | 0.7805748 | 0.5775171 | 1.000 | 0.000 |
| 90 | SE | 0.0123995 | 0.0091739 | 1.000 | 0.000 |
| 90 | quantile_10 | -1.4907011 | -1.8834868 | 1.000 | 0.000 |
| 90 | quantile_25 | -1.2205584 | -1.6445014 | 1.000 | 0.000 |
| 90 | median | -0.8291125 | -1.3337679 | 1.000 | 0.000 |
| 90 | quantile_75 | -0.2722850 | -0.9981104 | 1.000 | 0.000 |
| 90 | quantile_90 | 0.4035077 | -0.5426452 | 1.000 | 0.000 |
St. Johns River
Simulate 1000 independent replicates (loci) for 1 - 36 segregating sites under neutral model for 20 individuals (40 observations) for no population expansion and for population expansion for alpha = 5, 10, 30, 90.
# simulated neutral loci
for i in {1..36}
do
scr/MS/ms 40 1000 -s $i | scr/MS/sample_stats > results/Ind20_Sites${i}_a0_Rep1000.stats
done
# simulated neutral loci with population expansion for alpha
for a in 5 10 30 90
do
for i in {1..36}
do
scr/MS/ms 40 1000 -s $i -G $a | scr/MS/sample_stats > results/Ind20_Sites${i}_a${a}_Rep1000.stats
done
done
Import results to compare null distribution of Tajima’s D for each locus according to number of segregating sites.
Create simulated data set based on empirical data set, i.e. determine number of loci with 1 … x segregating sites, then randomly draw that number of loci from simulated data set for each class of segregating loci for neutral loci w/no population growth.
Cumulative distribution of Tajima’s D for empirical and simulated data sets for a range of population expansion values.
Compare distributions of empirical and simulated data sets to determine if parameters describing the empirical distributions is significantly larger or smaller than the simulated null distribution.
Summary statistics of the distribution of Tajima’s D.
| ALPHA | PARAM | OBS | SIM | PVALsmaller | PVALlarger |
|---|---|---|---|---|---|
| 0 | mean | -0.3462437 | -0.0439215 | 0.000 | 1.000 |
| 0 | std | 0.8460972 | 0.9696028 | 0.000 | 1.000 |
| 0 | SE | 0.0139722 | 0.0160117 | 0.000 | 1.000 |
| 0 | quantile_10 | -1.3165895 | -1.1748057 | 0.000 | 1.000 |
| 0 | quantile_25 | -1.0313223 | -0.8346280 | 0.000 | 1.000 |
| 0 | median | -0.4833589 | -0.1388095 | 0.000 | 1.000 |
| 0 | quantile_75 | 0.1752560 | 0.6228500 | 0.000 | 1.000 |
| 0 | quantile_90 | 0.8826601 | 1.3525285 | 0.000 | 1.000 |
| 10 | mean | -0.3462437 | -0.6666791 | 1.000 | 0.000 |
| 10 | std | 0.8460972 | 0.7852610 | 1.000 | 0.000 |
| 10 | SE | 0.0139722 | 0.0129676 | 1.000 | 0.000 |
| 10 | quantile_10 | -1.3165895 | -1.4969545 | 1.000 | 0.000 |
| 10 | quantile_25 | -1.0313223 | -1.1910799 | 1.000 | 0.000 |
| 10 | median | -0.4833589 | -0.8360661 | 1.000 | 0.000 |
| 10 | quantile_75 | 0.1752560 | -0.2587536 | 1.000 | 0.000 |
| 10 | quantile_90 | 0.8826601 | 0.4446643 | 1.000 | 0.000 |
| 30 | mean | -0.3462437 | -0.9078948 | 1.000 | 0.000 |
| 30 | std | 0.8460972 | 0.7084764 | 1.000 | 0.000 |
| 30 | SE | 0.0139722 | 0.0116996 | 1.000 | 0.000 |
| 30 | quantile_10 | -1.3165895 | -1.6739284 | 1.000 | 0.000 |
| 30 | quantile_25 | -1.0313223 | -1.4184819 | 1.000 | 0.000 |
| 30 | median | -0.4833589 | -1.1158471 | 1.000 | 0.000 |
| 30 | quantile_75 | 0.1752560 | -0.5674188 | 1.000 | 0.000 |
| 30 | quantile_90 | 0.8826601 | 0.0790274 | 1.000 | 0.000 |
| 5 | mean | -0.3462437 | -0.5227762 | 1.000 | 0.000 |
| 5 | std | 0.8460972 | 0.8177376 | 0.998 | 0.002 |
| 5 | SE | 0.0139722 | 0.0135039 | 0.998 | 0.002 |
| 5 | quantile_10 | -1.3165895 | -1.4726406 | 1.000 | 0.000 |
| 5 | quantile_25 | -1.0313223 | -1.1241070 | 1.000 | 0.000 |
| 5 | median | -0.4833589 | -0.6596520 | 1.000 | 0.000 |
| 5 | quantile_75 | 0.1752560 | -0.0266447 | 1.000 | 0.000 |
| 5 | quantile_90 | 0.8826601 | 0.5947087 | 1.000 | 0.000 |
| 90 | mean | -0.3462437 | -1.1136465 | 1.000 | 0.000 |
| 90 | std | 0.8460972 | 0.6357307 | 1.000 | 0.000 |
| 90 | SE | 0.0139722 | 0.0104983 | 1.000 | 0.000 |
| 90 | quantile_10 | -1.3165895 | -1.7732140 | 1.000 | 0.000 |
| 90 | quantile_25 | -1.0313223 | -1.5276425 | 1.000 | 0.000 |
| 90 | median | -0.4833589 | -1.1397679 | 1.000 | 0.000 |
| 90 | quantile_75 | 0.1752560 | -0.8367252 | 1.000 | 0.000 |
| 90 | quantile_90 | 0.8826601 | -0.2998549 | 1.000 | 0.000 |
Charleston Harbor
Simulate 1000 independent replicates (loci) for 1 - 36 segregating sites under neutral model for 18 individuals (36 observations) for no population expansion and for population expansion for alpha = 5, 10, 30, 90.
# simulated neutral loci
for i in {1..36}
do
scr/MS/ms 36 1000 -s $i | scr/MS/sample_stats > results/Ind18_Sites${i}_a0_Rep1000.stats
done
# simulated neutral loci with population expansion for alpha
for a in 5 10 30 90
do
for i in {1..36}
do
scr/MS/ms 36 1000 -s $i -G $a | scr/MS/sample_stats > results/Ind18_Sites${i}_a${a}_Rep1000.stats
done
done
Import results to compare null distribution of Tajima’s D for each locus according to number of segregating sites.
Create simulated data set based on empirical data set, i.e. determine number of loci with 1 … x segregating sites, then randomly draw that number of loci from simulated data set for each class of segregating loci for neutral loci w/no population growth.
Cumulative distribution of Tajima’s D for empirical and simulated data sets for a range of population expansion values.
Compare distributions of empirical and simulated data sets to determine if parameters describing the empirical distributions is significantly larger or smaller than the simulated null distribution.
Summary statistics of the distribution of Tajima’s D.
| ALPHA | PARAM | OBS | SIM | PVALsmaller | PVALlarger |
|---|---|---|---|---|---|
| 0 | mean | -0.3696008 | -0.0600008 | 0.000 | 1.000 |
| 0 | std | 0.8538342 | 0.9702727 | 0.000 | 1.000 |
| 0 | SE | 0.0140750 | 0.0159945 | 0.000 | 1.000 |
| 0 | quantile_10 | -1.3529714 | -1.2169483 | 0.000 | 1.000 |
| 0 | quantile_25 | -1.0847554 | -0.8240396 | 0.000 | 1.000 |
| 0 | median | -0.5129384 | -0.1662956 | 0.000 | 1.000 |
| 0 | quantile_75 | 0.2122607 | 0.6158444 | 0.000 | 1.000 |
| 0 | quantile_90 | 0.8667219 | 1.3503437 | 0.000 | 1.000 |
| 10 | mean | -0.3696008 | -0.6523428 | 1.000 | 0.000 |
| 10 | std | 0.8538342 | 0.7878707 | 1.000 | 0.000 |
| 10 | SE | 0.0140750 | 0.0129877 | 1.000 | 0.000 |
| 10 | quantile_10 | -1.3529714 | -1.4958990 | 1.000 | 0.000 |
| 10 | quantile_25 | -1.0847554 | -1.1833798 | 1.000 | 0.000 |
| 10 | median | -0.5129384 | -0.8141247 | 1.000 | 0.000 |
| 10 | quantile_75 | 0.2122607 | -0.2091295 | 1.000 | 0.000 |
| 10 | quantile_90 | 0.8667219 | 0.4451190 | 1.000 | 0.000 |
| 30 | mean | -0.3696008 | -0.8993336 | 1.000 | 0.000 |
| 30 | std | 0.8538342 | 0.7020571 | 1.000 | 0.000 |
| 30 | SE | 0.0140750 | 0.0115731 | 1.000 | 0.000 |
| 30 | quantile_10 | -1.3529714 | -1.6506044 | 1.000 | 0.000 |
| 30 | quantile_25 | -1.0847554 | -1.3928468 | 1.000 | 0.000 |
| 30 | median | -0.5129384 | -1.0944458 | 1.000 | 0.000 |
| 30 | quantile_75 | 0.2122607 | -0.5261850 | 1.000 | 0.000 |
| 30 | quantile_90 | 0.8667219 | 0.0357848 | 1.000 | 0.000 |
| 5 | mean | -0.3696008 | -0.5071613 | 1.000 | 0.000 |
| 5 | std | 0.8538342 | 0.8304972 | 0.989 | 0.011 |
| 5 | SE | 0.0140750 | 0.0136903 | 0.989 | 0.011 |
| 5 | quantile_10 | -1.3529714 | -1.4796425 | 1.000 | 0.000 |
| 5 | quantile_25 | -1.0847554 | -1.1332130 | 1.000 | 0.000 |
| 5 | median | -0.5129384 | -0.6749958 | 1.000 | 0.000 |
| 5 | quantile_75 | 0.2122607 | 0.0270473 | 1.000 | 0.000 |
| 5 | quantile_90 | 0.8667219 | 0.6706195 | 1.000 | 0.000 |
| 90 | mean | -0.3696008 | -1.0936692 | 1.000 | 0.000 |
| 90 | std | 0.8538342 | 0.6518576 | 1.000 | 0.000 |
| 90 | SE | 0.0140750 | 0.0107456 | 1.000 | 0.000 |
| 90 | quantile_10 | -1.3529714 | -1.7581458 | 1.000 | 0.000 |
| 90 | quantile_25 | -1.0847554 | -1.5065644 | 1.000 | 0.000 |
| 90 | median | -0.5129384 | -1.1380110 | 1.000 | 0.000 |
| 90 | quantile_75 | 0.2122607 | -0.8136294 | 1.000 | 0.000 |
| 90 | quantile_90 | 0.8667219 | -0.2310249 | 1.000 | 0.000 |
Winyah Bay
Simulate 1000 independent replicates (loci) for 1 - 36 segregating sites under neutral model for 18 individuals (36 observations) for no population expansion and for population expansion for alpha = 5, 10, 30, 90.
# simulated neutral loci
for i in {1..36}
do
scr/MS/ms 36 1000 -s $i | scr/MS/sample_stats > results/Ind18_Sites${i}_a0_Rep1000.stats
done
# simulated neutral loci with population expansion for alpha
for a in 5 10 30 90
do
for i in {1..36}
do
scr/MS/ms 36 1000 -s $i -G $a | scr/MS/sample_stats > results/Ind18_Sites${i}_a${a}_Rep1000.stats
done
done
Import results to compare null distribution of Tajima’s D for each locus according to number of segregating sites.
Create simulated data set based on empirical data set, i.e. determine number of loci with 1 … x segregating sites, then randomly draw that number of loci from simulated data set for each class of segregating loci for neutral loci w/no population growth.
Cumulative distribution of Tajima’s D for empirical and simulated data sets for a range of population expansion values.
Compare distributions of empirical and simulated data sets to determine if parameters describing the empirical distributions is significantly larger or smaller than the simulated null distribution.
Summary statistics of the distribution of Tajima’s D.
| ALPHA | PARAM | OBS | SIM | PVALsmaller | PVALlarger |
|---|---|---|---|---|---|
| 0 | mean | -0.3282842 | -0.0601800 | 0.000 | 1.000 |
| 0 | std | 0.8595211 | 0.9700697 | 0.000 | 1.000 |
| 0 | SE | 0.0143174 | 0.0161589 | 0.000 | 1.000 |
| 0 | quantile_10 | -1.3004334 | -1.2193681 | 0.000 | 1.000 |
| 0 | quantile_25 | -1.0355594 | -0.8234185 | 0.000 | 1.000 |
| 0 | median | -0.4826751 | -0.1647173 | 0.000 | 1.000 |
| 0 | quantile_75 | 0.2416686 | 0.6138551 | 0.000 | 1.000 |
| 0 | quantile_90 | 0.8841898 | 1.3471081 | 0.000 | 1.000 |
| 10 | mean | -0.3282842 | -0.6545048 | 1.000 | 0.000 |
| 10 | std | 0.8595211 | 0.7869050 | 1.000 | 0.000 |
| 10 | SE | 0.0143174 | 0.0131078 | 1.000 | 0.000 |
| 10 | quantile_10 | -1.3004334 | -1.4960315 | 1.000 | 0.000 |
| 10 | quantile_25 | -1.0355594 | -1.1904854 | 1.000 | 0.000 |
| 10 | median | -0.4826751 | -0.8143589 | 1.000 | 0.000 |
| 10 | quantile_75 | 0.2416686 | -0.2111951 | 1.000 | 0.000 |
| 10 | quantile_90 | 0.8841898 | 0.4382791 | 1.000 | 0.000 |
| 30 | mean | -0.3282842 | -0.9001954 | 1.000 | 0.000 |
| 30 | std | 0.8595211 | 0.7017828 | 1.000 | 0.000 |
| 30 | SE | 0.0143174 | 0.0116899 | 1.000 | 0.000 |
| 30 | quantile_10 | -1.3004334 | -1.6532718 | 1.000 | 0.000 |
| 30 | quantile_25 | -1.0355594 | -1.3954230 | 1.000 | 0.000 |
| 30 | median | -0.4826751 | -1.0942429 | 1.000 | 0.000 |
| 30 | quantile_75 | 0.2416686 | -0.5269851 | 1.000 | 0.000 |
| 30 | quantile_90 | 0.8841898 | 0.0346313 | 1.000 | 0.000 |
| 5 | mean | -0.3282842 | -0.5087689 | 1.000 | 0.000 |
| 5 | std | 0.8595211 | 0.8295508 | 0.998 | 0.002 |
| 5 | SE | 0.0143174 | 0.0138182 | 0.998 | 0.002 |
| 5 | quantile_10 | -1.3004334 | -1.4810724 | 1.000 | 0.000 |
| 5 | quantile_25 | -1.0355594 | -1.1332130 | 1.000 | 0.000 |
| 5 | median | -0.4826751 | -0.6754370 | 1.000 | 0.000 |
| 5 | quantile_75 | 0.2416686 | 0.0255215 | 1.000 | 0.000 |
| 5 | quantile_90 | 0.8841898 | 0.6658911 | 1.000 | 0.000 |
| 90 | mean | -0.3282842 | -1.0956612 | 1.000 | 0.000 |
| 90 | std | 0.8595211 | 0.6511864 | 1.000 | 0.000 |
| 90 | SE | 0.0143174 | 0.0108471 | 1.000 | 0.000 |
| 90 | quantile_10 | -1.3004334 | -1.7599900 | 1.000 | 0.000 |
| 90 | quantile_25 | -1.0355594 | -1.5099222 | 1.000 | 0.000 |
| 90 | median | -0.4826751 | -1.1404943 | 1.000 | 0.000 |
| 90 | quantile_75 | 0.2416686 | -0.8139015 | 1.000 | 0.000 |
| 90 | quantile_90 | 0.8841898 | -0.2314143 | 1.000 | 0.000 |
Calculates two p-values: (1) p-value assuming that \(D\) follows a normal distribution with mean zero and variance one (normal), (2) p-value assuming that \(D\) follows a beta distribution after rescaling on [0, 1] following Tajima 1989. Beta distribution is known to be overly conservative (i.e. fails to reject null hypothesis even when selection is occuring). The alternative is to simulate a null distribution for neutral loci using e.g. coalescent theory.
A total of 422 (12 after adjustment) loci were flagged as outlier in at least one estuary, 63 (5 after adjustment) loci were flagged as a positive outlier in at least one estuary, 361 (9 after adjustment) loci as negative outlier. A total of 167 (2 after adjustment) loci were flagged as both positive and negative outlier multiple estuaries.
Identify the number of loci that were flagged as negative and positive outlier in each estuary.
Number of loci flagged as significant positive or negative outlier per estuary.
| Estuary | NEGATIVE | POSITIVE |
|---|---|---|
| SanAntonioBay | 86 | 19 |
| SabineLake | 108 | 12 |
| BaratariaBay | 94 | 11 |
| WestMississippiSound | 93 | 14 |
| MobileBay | 89 | 13 |
| ApalachicolaBay | 80 | 19 |
| StJohnsRiver | 25 | 15 |
| CharlestonHarbor | 31 | 19 |
| WinyahBay | 20 | 21 |
Compare after adjusting per estuary.
Number of loci flagged as significant positive or negative outlier per estuary after adjusting by estuary.
| Estuary | NEGATIVE | POSITIVE |
|---|---|---|
| SanAntonioBay | 2 | 2 |
| SabineLake | 3 | 4 |
| BaratariaBay | 1 | 4 |
| WestMississippiSound | 4 | 4 |
| MobileBay | 4 | 4 |
| ApalachicolaBay | 1 | 4 |
| StJohnsRiver | 1 | NA |
| CharlestonHarbor | 1 | 1 |
| WinyahBay | NA | 1 |
Identify loci that have been previously mapped on the linkage map to identify patterns of clustering if present.
Distribution of Tajima’s D per locus calculated for individuals within each estuary across the genome. Loci flagged as significant outlier compared to the expected null distribution under equilibrium are indicated in red.
A total of 155 loci flagged as significant outlier in at least one estuary were located on the map.
\(\theta = 4N_{e}\mu\) is the population-scaled mutation rate, with \(\mu\) = mutation rate per bp per generation. Because of its relationship to the long-term effective population size it can be used as a proxy for \(N_{e}\) to understand the effects of drift. When \(N_{e}\) and \(\mu\) are unknown, \(\theta\) can be estimated on a locus-per-locus basis as \(\hat{\theta}_W\) using the number of segregating sites (observed nucleotide diversity \(S\)) or as \(\theta_{T}\) based on pairwise differences among haplotypes (nucleotide diversity \(\pi\)).
Watterson’s estimator (segregating sites)
Watterson’s estimator \(\theta_{W}\) (Watterson 1975Watterson, G. A. 1975. “On the number of segregating sites in genetical models without recombination.” Theoretical Population Biology 7 (2): 256–76. https://doi.org/10.1016/0040-5809(75)90020-9.) is an empirical way to measure genetic variation based on the number of segregating sites in a locus (observed nucleotide diversity). Despite the fact that estimates of \(\theta_{W}\) is biased for loci under selection and by population expansion it can be used as a proxy for relative differences in long-term effective population Ne size among populations even when a genome is not in equilibrium due to the theoretical relationship of \(\theta\) and Ne (\(\theta = 4N_{e}\mu\)).
Compare differences in distribution of \(\theta_{W}\) across all loci among estuaries.
Distribution of theta W per estuary.
Comparison of mean +/- standard deviation, minimum, and maximum values of theta W across loci among estuaries.
| pop | mean | std | max | min |
|---|---|---|---|---|
| SanAntonioBay | 0.4543470 | 0.3537153 | 2.696269 | 0 |
| SabineLake | 0.4814806 | 0.3642302 | 3.257992 | 0 |
| BaratariaBay | 0.5364753 | 0.3970729 | 3.595026 | 0 |
| WestMississippiSound | 0.5122540 | 0.3805040 | 3.257992 | 0 |
| MobileBay | 0.6024658 | 0.4263088 | 3.482681 | 0 |
| ApalachicolaBay | 0.5515221 | 0.4013758 | 2.920958 | 0 |
| StJohnsRiver | 0.3704835 | 0.3042725 | 2.022202 | 0 |
| CharlestonHarbor | 0.3596934 | 0.2960188 | 2.359236 | 0 |
| WinyahBay | 0.3531063 | 0.2952937 | 2.246891 | 0 |
Mean values of \(\theta_{W}\) are lower are lower in estuaries of the Atlantic compared to the Gulf of Mexico, implying that long-term effective population sizes are larger in Gulf estuaries compared to estuaries in the Atlantic.
Test for significant differences among ocean basins using a Mann-Whitney test (unpaired two-samples Wilcoxon test/Wilcoxon rank sum test).
##
## Wilcoxon rank sum test with continuity correction
##
## data: mean by OCEAN
## W = 0, p-value = 0.02819
## alternative hypothesis: true location shift is not equal to 0
Nucleotide diversity \(\pi\) (pairwise differences)
\(\theta_{T}\) (Nei 1987Nei, Masatoshi. 1987. Molecular Evolutionary Genetics. https://doi.org/10.7312/nei-92038.) is calculated as the sum of the number of differences between pairs of haplotypes of a given locus divided by the number of comparisons made. This parameter is baised towards allele segregating at intermediate rates.
Compare differences in distribution of \(\theta_{T}\) across all loci among estuaries.
Distribution of theta T across all loci among estuaries.
Comparison of mean +/- standard deviation, minimum, and maximum values of the nucleotide diversity (pi) across loci among estuaries.
| pop | mean | std | max | min |
|---|---|---|---|---|
| SanAntonioBay | 0.0021120 | 0.0019737 | 0.0147258 | 0 |
| SabineLake | 0.0021483 | 0.0019934 | 0.0151231 | 0 |
| BaratariaBay | 0.0021150 | 0.0019717 | 0.0170939 | 0 |
| WestMississippiSound | 0.0021052 | 0.0019699 | 0.0170587 | 0 |
| MobileBay | 0.0021120 | 0.0019506 | 0.0160608 | 0 |
| ApalachicolaBay | 0.0021141 | 0.0019686 | 0.0172357 | 0 |
| StJohnsRiver | 0.0020896 | 0.0019782 | 0.0151531 | 0 |
| CharlestonHarbor | 0.0020789 | 0.0019580 | 0.0155599 | 0 |
| WinyahBay | 0.0020789 | 0.0019930 | 0.0178757 | 0 |
Mean values of \(\theta_{T}\) are lower are marginally lower in estuaries of the Atlantic compared to the Gulf of Mexico, implying that long-term effective population sizes are larger in Gulf estuaries compared to estuaries in the Atlantic.
Test for significant differences among ocean basins using a Mann-Whitney test (unpaired two-samples Wilcoxon test/Wilcoxon rank sum test).
##
## Wilcoxon rank sum test with continuity correction
##
## data: mean by OCEAN
## W = 0, p-value = 0.02819
## alternative hypothesis: true location shift is not equal to 0
Redundancy analyses (RDA), as implemented in vegan (Oksanen et al. 2013Oksanen, J, F. G Blanchet, R Kindt, P Legendre, P. R Minchin, R. B O’Hara, G. L Simpson, et al. 2013. “Package ‘vegan’. Community ecology package.”), was used to assess the influence of geographic distance and of variable environmental parameters and their interaction on observed patterns of genomic variation among samples from the Gulf. RDA was not carried out among samples from the Atlantic because of limited geographic spread.
RDA is a constrained ordination method that extracts and summarizes components of variation in a multidimensional data set explained by a set of explanatory variables. Its purpose is to parse and visualize components of genomic variation (response variables) that are explained by geography and/or environment (explanatory variables) and to identify alleles/loci driving an observed pattern. While applying it in this way were multiple individuals from the same location are assigned the same values for spatial and environmental variables may result in some bias due to pseud-replication, not transferring the response variables to a sample level, for example using a correspondance analysis allows for the application of the RDA as an association approach when using genomic data that may not rely on equilibrium assumptions present in \(F_{ST}\)-based analyses (Forester et al. 2018Forester, Brenna R., Jesse R. Lasky, Helene H. Wagner, and Dean L. Urban. 2018. “Comparing methods for detecting multilocus adaptation with multivariate genotype–environment associations.” Molecular Ecology 27 (9): 2215–33. https://doi.org/10.1111/mec.14584.). The R2 value obtained is equivalent to the proportion of genomic variation explained by explanatory variables, allowing for a comparison of the relative importance of the variables and their interaction.
Response variables: Genotype data
Format response variables (genetic data) as allele counts per individual.
RDA requires a complete data set. Replace missing values with mean allele frequency, then extract allele counts per individual.
Constraining matrix: Coastal distance (spatial impacts)
Geographic distance was measured as approximate coastline distance. For coastal fish species coastal distance more appropriate parameterization of geography as relationship might not be linear.
Format parameters to create spatial matrix by calculating polynomials (10th degree) geographic distance to fit trend surface. To reduce effects of pseudo-replication, jitter distance.
Select best model for spatial variables (coastal distance) using forward model selection and permutation tests (999 permutations).
# run initial rda
rda.xy = rda(allelecount ~ ., data.frame(xy), scale = FALSE)
# forward selection of spatial variables
stp.xy = ordiR2step(rda(allelecount ~ 1, data.frame(xy)),
scope = formula(rda.xy),
R2scope = FALSE,
scale = FALSE,
Pin = 0.15,
direction="forward",
R2permutations = 999,
parallel = 45)
## Step: R2.adj= 0
## Call: allelecount ~ 1
##
## R2.adjusted
## + DIST_X3 0.00014093653
## + DIST_jitter 0.00006475115
## + DIST 0.00006432336
## + DIST_X1 0.00006432336
## <none> 0.00000000000
## + DIST_X2 -0.00002513934
##
## Df AIC F Pr(>F)
## + DIST_X3 1 1932.6 1.0331 0.02 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: R2.adj= 0.0001409365
## Call: allelecount ~ DIST_X3
##
## R2.adjusted
## + DIST_jitter 0.0002063814
## + DIST 0.0002061408
## + DIST_X1 0.0002061408
## <none> 0.0001409365
## + DIST_X2 0.0001162942
##
## Df AIC F Pr(>F)
## + DIST_jitter 1 1933.6 1.0153 0.186
selected_xy <- attributes(stp.xy$terms)$term.labels
Selected parameters for the best model is DIST_X3.
Constraining matrix: Environmental differences among estuaries
Data for 39 environmental parameters for each of the sample locations in the Gulf were obtained from the National Estuary Eutrophication Assessment database (NEAA 2018NEAA. 2018. “National Estuary Eutrophication Assessment.” http://ian.umces.edu/neea/siteinformation.php.). Parameters were then PCA-transformed, resulting in new synthetic variables summarizing environmental differences among estuaries. Forward selection of variables was performed on the synthetic variables to identify the best model of environmental variables and based on the results, the PC were used as the constraining matrix for the final RDA; significance was determined using 1,000 permutations.
Import environmental data from National Estuary Eutrophication Assessment to use as environmental predictors.
Parameters in data set include Estuary Area, Tidal Fresh Zone Area, Mixing Zone Area, Saltwater Zone Area, Estuary Volume, Estuary Depth, Estuary Perimeter, Tide Height, Tide Volume, Tide Volume/Day, Tide Ratio, Stratification Ratio, Percent Mixed Water, Percent Seawater, Average Salinity, Tidal Exchange, Daily Freshwater/Estuary Area, Daily Freshwater, Flow/Estuary Area, Total Freshwater Volume, Daily Precipitation, Daily Evaporation, Daily Precipitation/Estuary Area, Daily Evaporation/Estuary Area, Flow, Wind Speed, Sea Surface Temperature (mean), Sea Surface Temperature (std), Ocean Salinity (mean), Ocean Salinity (max), Ocean Salinity (min), Ocean Dissolved Inorganic Phosphorus, Oceanic NO3, Total Suspended Sediment, Total Nitrogen, Total Phosphorus, Total Suspended Sediment/Estuary Area, Total Nitrogen/Estuary Area, Total Phosphorus/Estuary Area, Wetland (km2) total, and Wetland (km2) percent.
Select environmental information for estuaries that have been sampled and perform PCA on scaled parameters.
Scree plot indicating percent variance summarized by each principle component.
PCA transform all significant environmental parameters and select model through forward selection using adjusted R2 as the stopping criterion.
## Step: R2.adj= 0
## Call: allelecount ~ 1
##
## R2.adjusted
## + PC8 0.00015395211
## + PC4 0.00011365471
## + PC6 0.00008836192
## + PC1 0.00007349204
## + PC3 0.00004668602
## + PC7 0.00001146825
## <none> 0.00000000000
## + PC2 -0.00004157001
## + PC5 -0.00010066119
## + PC9 -0.00010210653
##
## Df AIC F Pr(>F)
## + PC8 1 1932.6 1.0362 0.024 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: R2.adj= 0.0001539521
## Call: allelecount ~ PC8
##
## R2.adjusted
## + PC4 0.00025497145
## + PC1 0.00022665873
## + PC7 0.00019480392
## + PC3 0.00019328549
## + PC6 0.00017626019
## <none> 0.00015395211
## + PC2 0.00009336978
## + PC9 0.00005390553
## + PC5 0.00004739829
##
## Df AIC F Pr(>F)
## + PC4 1 1933.6 1.0236 0.076 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: R2.adj= 0.0002549714
## Call: allelecount ~ PC8 + PC4
##
## R2.adjusted
## + PC1 0.0003180708
## + PC6 0.0002782383
## + PC7 0.0002616803
## <none> 0.0002549714
## + PC3 0.0002520197
## + PC2 0.0001828849
## + PC9 0.0001697569
## + PC5 0.0001369291
##
## Df AIC F Pr(>F)
## + PC1 1 1934.5 1.0147 0.18
Selected parameters are PC8 and PC4.
Biplot of principal component analysis for selected PCs of environmental parameters for estuaries in the Western Gulf, central Gulf, and eastern Gulf.
Compare loadings on of variables on selected principle components
Loading of all environmental parameters on selected principle components, PC8 and PC4.
Variance partitioning was used to compare the contribution of geographic distance and variation in the environmental differences to explain the observed genomic variation. A full model, using selected geographic and environmental variables, a partial model, using geographic data conditioned on environmental variables, and a partial model, using environmental data conditioned on geographic data, were considered to partition the explainable variance into individual (geography or environment) and shared components (geography plus environment), using vegan. Significance of each component was tested using 1,000 permutations.
Partition variance to determine if geographic location or environmental data driving observed pattern(s).
Partition the explainable variance (i.e. total inertia for constrain matrix of full model): determine total, constrained, unconstrained values of inertia/proportion of total inertia (variance).
Full model
Perform RDA for all variables (spatial and environmental; full model).
## Call: rda(formula = allelecount ~ xy.mat + env.mat, scale = FALSE)
##
## Inertia Proportion Rank
## Total 3571.558 1.000
## Constrained 46.438 0.013 3
## Unconstrained 3525.120 0.987 232
## Inertia is variance
##
## Eigenvalues for constrained axes:
## RDA1 RDA2 RDA3
## 15.857 15.559 15.021
##
## Eigenvalues for unconstrained axes:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 20.686 20.650 20.539 20.358 20.311 20.142 20.105 20.037
## (Showing 8 of 232 unconstrained eigenvalues)
The proportion of variance explained by constrained PCA (rda) is 1.3%, adjusted R2 value is 0.0002391
Significance of overall model (costal distance + environment)
| Df | Variance | F | Pr(>F) | |
|---|---|---|---|---|
| Model | 3 | 46.43757 | 1.018738 | 0.023 |
| Residual | 232 | 3525.12011 | NA | NA |
Proportion of variance explained by each RDA axis.
| RDA1 | RDA2 | RDA3 | |
|---|---|---|---|
| Eigenvalue | 15.856806 | 15.559418 | 15.021348 |
| Proportion Explained | 0.341465 | 0.335061 | 0.323474 |
| Cumulative Proportion | 0.341465 | 0.676526 | 1.000000 |
Test for significance of each constrained axis.
| Df | Variance | F | Pr(>F) | |
|---|---|---|---|---|
| RDA1 | 1 | 15.85681 | 1.0435897 | 0.0509491 |
| RDA2 | 1 | 15.55942 | 1.0240176 | 0.2687313 |
| RDA3 | 1 | 15.02135 | 0.9886054 | 0.7572428 |
| Residual | 232 | 3525.12011 | NA | NA |
The overall model is significant (P = 0.023).
Partition the variation in genetic data into components accounted for by environmental and spatial variables and their combined effect using adjusted R squared to assess partitions explained by explanatory tables and their combinations.
# Partition the variance ====
vpart.all <- varpart(allelecount, ~ xy.mat, ~ env.mat)
vpart.all
##
## Partition of variance in RDA
##
## Call: varpart(Y = allelecount, X = ~xy.mat, ~env.mat)
##
## Explanatory tables:
## X1: ~xy.mat
## X2: ~env.mat
##
## No. of explanatory tables: 2
## Total variation (SS): 839316
## Variance: 3571.6
## No. of observations: 236
##
## Partition table:
## Df R.squared Adj.R.squared Testable
## [a+b] = X1 1 0.00440 0.00014 TRUE
## [b+c] = X2 2 0.00876 0.00025 TRUE
## [a+b+c] = X1+X2 3 0.01300 0.00024 TRUE
## Individual fractions
## [a] = X1|X2 1 -0.00002 TRUE
## [b] 0 0.00016 FALSE
## [c] = X2|X1 2 0.00010 TRUE
## [d] = Residuals 0.99976 FALSE
## ---
## Use function 'rda' to test significance of fractions of interest
Extract fraction explained by each component:
Test significance of all components using permuted p-values.
Variance partitioning of variance explained by coastal distance (xy), environmental variables (env), and shared effects due to correlation of xy and env.
| PARTITION | VARIANCE | PVAL |
|---|---|---|
| residuals | 0.9997609 | NA |
| env+shared | 0.0002550 | 0.005 |
| xy+env+shared | 0.0002391 | 0.029 |
| shared | 0.0001568 | NA |
| xy+shared | 0.0001409 | 0.020 |
| env | 0.0000982 | 0.200 |
| xy | -0.0000158 | 0.576 |
Compare distribution of individuals using weighted average individual scores. WA are weighted averages of allele scores that are as similar to linear combinations (LC) scores as possible. Weights are site (individual) totals of species (loci). To determine how well explanatory variables separate groups of individuals or if explanatory variables can be used to discriminate between groups of individuals, use wa-scores (or both).
Biplot of redundancy analysis using PCA-transformed environmental variables and third polynomial of coastal distance as explanatory variables (red arrows). Individuals sampled in the Gulf (colored circles) are plotted according to their component loadings calculated as weighted average scores for individuals sampled in the Gulf.
Compare distribution of individuals for environmental component only.
env <- gen_rda@strata %>%
select(LIB_ID, ESTUARY) %>%
mutate(ESTUARY = as.character(ESTUARY)) %>%
left_join(env_pca_all) %>%
select(LIB_ID, one_of(selected_env)) %>%
column_to_rownames("LIB_ID")
rda.env <- rda(allelecount ~ ., data = env, scale = FALSE)
# environmental variables
env_rda <- as.data.frame((rda.env$CCA$biplot)) %>%
rownames_to_column("ENV") %>%
mutate(RDA1_flip = -(RDA1),
RDA2_flip = -(RDA2))
# extract individual scores
ind_rda.env <- rda_indv(rda.env, 2) %>%
left_join(strata) %>%
mutate(ESTUARY = ordered(ESTUARY,
levels = c("SanAntonioBay", "MatagordaBay", "GalvestonBay", "SabineLake",
"BaratariaBay", "WestMississippiSound", "EastMississippiSound", "MobileBay",
"ApalachicolaBay")))
tmp <- ind_rda.env %>%
mutate(RDA1_WA_SCALED3_flip = -(RDA1_WA_SCALED3),
RDA2_WA_SCALED3_flip = -(RDA2_WA_SCALED3))
# plot
ggplot() +
geom_vline(xintercept = 0, color = "darkblue", linetype = "dashed") +
geom_hline(yintercept = 0, color = "darkblue", linetype = "dashed") +
geom_label_repel(data = env_rda, aes(x = RDA1_flip, y = RDA2_flip, label = ENV)) +
geom_segment(data = env_rda, aes(x = 0, y = 0, xend = RDA1_flip, yend = RDA2_flip),
arrow = arrow(length = unit(0.2, "cm")),
color="darkred", size = 1) +
geom_point(data = tmp, aes(x = RDA1_WA_SCALED3_flip, RDA2_WA_SCALED3_flip, fill = ESTUARY),
shape = 21, size = 3) +
scale_fill_manual(values = col_estuaries) +
labs(x = "RDA1", y = "RDA2") +
theme_standard +
theme(legend.position = "bottom")
Use the Mahalanobis distance to identify alleles with strongest associations to both RDA axes.
Allele loadings as (weighted) orthonormal scores for independent/unconstrained variables, alleles with Mahalanobis distance > 25 are indicated in red.
A total of 384 loci are flagged as significantly associated with the constraining variables (Mahalanobis distance > 25).
Distribution of Mahalanobis distance across the linkage map. For each locus allele with the highest Mahalanobis distance is plotted. Outlier loci are indicated in red.
Package versions used for this analysis (R 3.6.0).
## package loadedversion
## ade4 ade4 1.7-15
## adegenet adegenet 2.1.3
## ape ape 5.4-1
## assigner assigner 0.5.6
## broom broom 0.5.6
## coin coin 1.3-1
## doMC doMC 1.3.6
## doParallel doParallel 1.0.15
## doSNOW doSNOW 1.0.18
## dplyr dplyr 1.0.0
## forcats forcats 0.5.0
## foreach foreach 1.5.0
## ggplot2 ggplot2 3.3.2
## ggrepel ggrepel 0.8.2
## ggthemes ggthemes 4.2.0
## glue glue 1.4.1
## hierfstat hierfstat 0.5-04
## Imap Imap 1.32
## iterators iterators 1.0.12
## knitr knitr 1.29
## lattice lattice 0.20-41
## mapdata mapdata 2.3.0
## mapproj mapproj 1.2.7
## maps maps 3.3.0
## mmod mmod 1.3.3
## pegas pegas 0.13
## permute permute 0.9-5
## plotrix plotrix 3.7-8
## plyr plyr 1.8.6
## poppr poppr 2.8.6
## purrr purrr 0.3.4
## radiator radiator 1.1.1
## randomForestSRC randomForestSRC 2.9.3
## readr readr 1.3.1
## reshape2 reshape2 1.4.4
## rgdal rgdal 1.5-10
## snow snow 0.4-3
## sp sp 1.4-2
## stringr stringr 1.4.0
## survival survival 3.2-3
## tibble tibble 3.0.3
## tidyr tidyr 1.1.0
## tidyverse tidyverse 1.3.0
## tint tint 0.1.2.2
## tufte tufte 0.6.1
## UpSetR UpSetR 1.4.0
## vcfR vcfR 1.11.0
## vegan vegan 2.5-6
## viridis viridis 0.5.1
## viridisLite viridisLite 0.3.0